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Analysis and Pattern Recognition in Large, Multidimensional 
Data Sets Using Low-Resolution Data Grouping 

A portion of the disclosure of this patent docximent contains material 
which is subject to copyright protection. The copyright owner has no objection to 
the facsimile reproduction by anyone of the patent document or the patent 
disclosure, as it appears in the Patent and Trademark Office patent file or records, 
but otherwise reserves all copyright rights whatsoever. 

1. Field of the Invention 

This invention relates to analysis and pattern recognition of data. More 
particularly, this invention relates to methods, systems and devices and 
combinations thereof for analysis and pattern recognition in large sets of 
multidimensional data using low-resolution data grouping, 

2. Background 

With the advent of computerization and the low cost of data storage and 
acquisition, people in many endeavors are now accumulating very large sets of 
data. For example, scientists in drug and chemical companies now use automation 
to perform so-called high-throughput screening ("HTS") of chemical compoimds. 
HTS uses automated, relatively low-cost techniques to obtain various items of 
information about chemical compounds. The goal of using HTS is to obtain 
information about a very large number of compounds in a quick and relatively 
low-cost manner. Having accumulated a very large HTS data set, it is necessary 
to evaluate the data in order to determine which, if any, of the analyzed 
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compounds warrants further investigation. However, the results of such HTS tend 
to be very large sets of multidimensional data, on the order of thousands of rows 
and dozens of columns, and so it is very difficult to make decisions just by 
looking at the data. 

In addition to the very large amounts of data produced by HTS, difficulties 
in existing data handling and analysis methods include the following: 

• Data comes from very diverse sources, including HTS laboratories, 
physical measurements, bio-scientists* laboratories, various computational 
software programs, etc., and the different sources tend to have very diverse 
kinds of output including numbers, text, mixed data types, error notations, 
blank data, replicate data (more than one value per compound), etc. 

• Not all sources produce data on the same list of compounds, or in the same 
order. 

• Some data values are misleadingly too precise, i.e., have high relative 
experimental errors or noise, and can easily be over-interpreted. 

• Medicinal chemists have to weigh very different kinds of factors (for 
example, molecular weight vs. dose-responsiveness vs. ClogP vs. 
secondary biology vs. selectivity across screens) in trying to determine 
which are the best compoimds or clusters of compounds to which to 
devote further work. 
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Summary of the Invention 

This invention solves the above and other problems by providing 
automated tools to help with and speed up these data handling and analysis 
processes. These tools embody some assimiptions about how the data should be 
treated by internalizing the most generally acceptable assumptions, but leaving 
more idiosyncratic decisions to individual users. 

A central concept on which this invention is based is grouping data into a 
relatively small number of categories using low-resolution data grouping. The 
grouping is visualized by assigning colors to data groups, e.g., in spreadsheets. 
Grouping of data potentially changes the precision of the data. 

This categorization of data has several major benefits, including: 

• creating a visual means of finding data patterns; 

• beneficially blurring small variations in numerical data that are, in 
practice, excessively fine distinctions, possibly due to experimental 
error; £ind 

• providing, in the colors themselves, a means or "common currency" to 
evaluate candidates across a wide range of data types. 

Accordingly, in one aspect, this invention provides mechanisms to 
expedite pattern recognition in large sets of multidimensional data, such as those 
that chemists assemble when evaluating hits from high-throughput screening 
(HTS) and deciding which ones vsnll get priority for fiirther investigation. In 
controlled trials, this invention has reduced the time to evaluate real data sets, 
firom days of intense human effort, which is vulnerable to errors due to volume or 
fatigue, to a few minutes of automation with graphical presentation of results. 
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It quickly becomes obvious upon using the system that the tools also have 
value in data-handling areas other than HTS. Examples include selection and 
management of any kind of tabulated data, e.g., portfolio management for any 
kind of rated portfolios, selection of drug candidate compounds, selection and 
management of proteins that are candidates for targets for drugs, selection and 
management of research projects competing for resources, and evaluating 
employee performance or job candidates. 

The system of this invention includes a new special command menu, a set 
of graphical user interface worksheets, and action buttons to facilitate the coloring 
and color analysis processes for the user. While the central process is the data 
grouping and coloring, there are also new tools for the upstream, or pre-grouping 
and coloring processes of importing, assembling, regularizing, and characterizing 
data in a spreadsheet, and for the downstream processes of visualizing, scoring, 
comparing, and sorting large amounts of color-coded data. The data-grouping and 
spreadsheet-coloring tool is presently implemented with a flexible, powerful, and 
convenient user interface that does not require knowledge of spreadsheet macros 
or of the Visual Basic language (used for the system's implementation). 

Accordingly, this invention provides methods, systems and devices for 
operating on data. 

In one aspect, the method of this invention provides at least one user- 
defined grouping rule for grouping the data into a user-definable number of 
groups. At least one of the grouping rules is applied to the data. The data may be 
provided in a table and the grouping rule applies to at least one user-selectable 
column of the table. In some embodiments, the grouping rule defines breakpoints 
corresponding to the user-definable number of groups. Application of the rule the 
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data divides the data into groups based on the breakpoints. The method may 
include presenting the grouped data in a manner that visually distinguishes the 
groups. In some embodiments, the grouping rules associate colors with groups 
and the grouped data is presented with an aspect of the data colored according to 
the rules. 

Sometimes the data are in labeled columns in a spreadsheet, and the 
grouping rule specifies at least one breakpoint and a corresponding color for each 
range defined by the breakpoint. The grouped data are presented by coloring each 
data item in one labeled column of the data based on the breakpoint and the 
corresponding color of the breakpoint. 

The breakpoints may be numeric or textual values. In some embodiments, 
the breakpoint is determined automatically based on the data. 

Sometimes the data are provided in a table, and backgrounds of table cells 
are colored according to the rules. 

The number of groups may be fewer than a number of possible data values. 

In another aspect, this invention is a method of operating on data by 
providing at least one user-defined grouping rule for grouping the data into a user- 
definable niunber of groups. The grouping rule is applied to the data to generate 
grouped data. At least one user-defined scoring rule is used to score grouped data 
according to user-defined scores. The scoring rule is applied to the grouped data 
to score the grouped data. 

In yet another aspect, this invention is a method of operating on data, in 
which data are grouped by applying to the data at lesist one user-defined grouping 
rule for grouping the data into a user-definable number of groups. The grouped 
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data are scored by applying to the grouped data at least one user-defined scoring 
rule for scoring the grouped data according to user-defined scores. 

In some embodiments the data can be a number of parameters for each of a 
number of cases and the scoring rule comprises a scoring function of user- 
selectable parameters and user-defined weights for the selected parameters to be 
used in scoring the cases. The scoring applies the fiinction to the data to obtain a 
score for each case. Sometimes the method includes sorting the scored cases by 
score, individually or by cluster, as described below. 

The notion of clustering is that subsets of the various cases may be 
associated into clusters by having identical entries in any user-selected column of 
data, known as a clustering colunm. In some embodiments of the invention, the 
integrated clusters are treated by averaging the properties of all the cases which 
comprise each cluster. 

Thus, according to aspects of this invention, in order to facilitate analysis 
and pattern recognition in large, multidimensional data sets, the precision of the 
data is potentially changed (implemented, e.g., by grouping the data) and then the 
data are presented for visualization (implemented, e.g., by coloring the data). 

Brief Description of the Drawings 

This file contains at least one dravsdng executed in color. Copies of this 
patent with color drawings will be provided by the United States Patent and 
Trademark Office upon request and payment of the necessary fee. 

The above and other objects and advantages of the invention will be 
apparent upon consideration of the following detailed description, taken in 
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conjunction with the accompanying drawings, in which the reference characters 
refer to like parts throughout and in wiiich: 

Figure 1 shows a typical computer system on which the present invention 
operates; 

Figure 2 shows an overview of the functionality of the present invention; 
Figures 3A-3B depict a display of data in a spreadsheet; 
Figures 4A-4B show a color control rules worksheet according to one 
embodiment of the present invention; 

Figures 5A-5B show data coloring rules; 

Figures 6A-6C show a data coloring control panel and a flow chart of the 
data coloring process, respectively; 

Figures 7A-8B show the worksheet of Figure 3 A and 3B after various 
coloring rules in Figure 4A have been applied; 

Figures 9A, 9B, IOA, and lOB depict displays of data in spreadsheets; 

Figures 11A and IIB show the form of the cluster control worksheet 
according to one embodiment of the present invention; 

Figures IIC-IID shows control panels from the cluster control 
worksheet of FIGURES IIA-IIB; 

Figure 12 shows the enlarging of the cluster starts mechanism according 
to one embodiment of the present invention; 

Figures 13A-13D show the application of vertical display re-scaling 
according to one embodiment of the present invention; 

Figures 14A-14D and 15A-15B show the appUcation of the scoring and 
sorting of clusters according to one embodiment of the present invention; 
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Figures 16A-16N, 16P and 16Q show aspects of the application of the 
dose-response scoring and estimation of potencies according to one embodiment 
of the present invention; 

Figures 17A-17B show the appHcation of the sheet statistics tool 
according to one embodiment of the present invention; 

Figures 18A-18D show the application of the scoring and sorting of 
clusters for the purpose of project prioritization and management according to one 
embodiment of the present invention; 

Figures 19-24 show examples of the application of this invention to 
various types of data; and 

Figures 25 and 26 show application of an aspect of this invention. 

Detailed Description of the Presently Preferred Exemplary 

Embodiments 

Overview 

Figure 1 shows a typical computer system 100 on which the present 
invention operates. The computer system 100 includes a processor (CPU) 102 
connected to a memory system 104 and a display 106. The computer system also 
includes various input devices including a keyboard 108 and a mouse 110 or other 
pointing device. Internal storage 112 (e.g., a hard disk, a CD ROM and the like) 
and external storage 114 (such as a floppy disk, CD ROM and the like) are also 
provided. 

Various aspects of this invention are implemented as computer software 
programs or algorithms 116 which run on the computer system 100. The software 
programs 116 can reside in the internal storage 112, the external storage 114, 
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and/or in the memory 104. The software programs 116 operate on data 118 which 
is provided, e.g., on the external storage 114. The software programs 116 operate 
in a standard and known manner by being executed on the processor 102 of the 
computer system 100. 

In some embodiments of the present invention, the user can create and 
modify various executable rules 120 which can operate on the data 118. For the 
sake only of explanation, the rules 120 are depicted separately from the data in the 
figures. However, as explained in more detail below, some or all of the rules 120 
can be part of the data 118. 

In preferred embodiments, the computer system 100 is capable of running 
the spreadsheet program Excel™ 95 (hereinafter "Excel") from Microsoft 
Corporation, and the software computer programs 116 are written in Microsoft 
Corporation's Visual Basic (hereinafter "VB") and are provided as an add-in to 
Excel. A single copy of software thus serves all data files on a particular 
machine. To conserve EXCEL resources, in some embodiments, the package self- 
installs the add-in when the user opens a data file, and un-installs the add-in when 
the last data file in memory is closed. 

In a preferred embodiment, this invention works entirely within the 
environment of Excel. Excel structures data files as workbook files which 
contain worksheets. The programs 116 of this invention consist of special Excel 
worksheets, called control sheets, on which input data is written by the user into 
designated labeled cells. The control sheets are part of the same EXCEL workbook 
file as the data. The control sheets also contain action buttons to execute the 
various procedures associated with this invention. The rules 120 are formed by 
setting various parameters in the control sheets. 
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When the workbook file is saved, the parameters (for the rules 120) are 
stored on the control sheets along with the data, and they can be modified and/or 
re-executed at any time without having to re-enter anything. The results of 
operations are automatically written as worksheets in the same workbook file, 
providing a convenient, integrated data environment in a single file. 

The system according to the present invention operates, in one aspect, in 
accordance with FIGURE 2. Recall that the user's aim is to perform analysis and 
pattern recognition in large, multidimensional data sets using (potentially low 
resolution) data grouping. To this end, the user and/or the system will create rules 
for coloring and presenting the data. First (at 122) a user creates and organizes the 
data 118. Various tools (discussed below) are provided to aid in the creation and 
organization of the data. Then (at 124) the user creates rules 120 for operating on 
the data 118. The rules 120 can be created before or after the data 118, rules can 
be reused for different sets of data and multiple rules can apply to the same data. 
The creation and operation of rules are discussed in greater detail below. Once the 
data 118 and the rules 120 are created, the user then selects some (or all) of the 
rules to apply to the data (at 126). Specifically, the user groups and thereby colors 
the data according to selected rules. With the data grouped and colored according 
to the rules, the user can then perform group/color-mediated data mining (at 128). 

Figures 3A-3B show views of the program of this invention in operation 

with a sample Excel sheet 300, denoted "DEMO 1" (302) containing data (not all 

the data in the sheet is visible). The views of EXCEL worksheets shown in the 

various figures and examples that follow are the views that are presented on the 

display 106 of the computer system 100. Sheets in an EXCEL workbook are 

labeled with tabs at the bottom of the worksheet. The data on the "DEMOl" sheet 

10 
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300 consists of eight columns of data for each of a number of compounds. The 
compounds are denoted "Cmpdxx", where "xx" ranges from "01" to the number of 
compounds. In Figure 3B, the last compound visible on the data sheet is 
"Cmpd58". The eight columns are headed: 

1. "Cmpd" (colunm A); 

2. "Series" (column B); 

3. "Testl" (column C); 

4. "Test2" (column D); 

5. "Test3" (column E); 

6. "HTS SPA Dose-Resp % Inhib @3xlO-6M" (column F); 

7. "HTS SPA Dose-Resp % Inhib @ 1 x 1 0-6M" (colunm G); 

8. "HTS SPA Dose-Resp % Inhib @ 3xlO-7M" (column H); and 

9. "HTS SPA Dose-Resp % Inhib @ Ixl0-7M" (column I). 

In addition to the "DEMO 1" worksheet 300, the Excel workbook shown 
in Figures 3A and 3B has seven other worksheets, denoted "DEMO 2" 304; 
"DEMO 3" 306; "clusterinfo DEMO" 308; "Append Control" 310; "Color 
Control" 312 and "Cluster Control" 314. The last three worksheets, denoted 
respectively "Append Control"; "Color Control" and "Cluster Control," contain 
various rules and controls (to be discussed below). The data in worksheets 
denoted "DEMO 1" 302; "DEMO 2" 304; "DEMO 3" 306; and "clusterinfo 
DEMO" 308 correspond to data 1 18 (Figure 1) and the controls or rules in the 
worksheets denoted "Append Control" 310; "Color Control" 312, and "Cluster 
Control" correspond to the rules 120 (Figure 1). 

Figures 4A-4B show a color control rules worksheet (312, denoted "color 

control") according to the present invention, as displayed on display 106 of the 

11 



wo 01/08039 PCT/USOO/20401 
computer system 100. The color control worksheet 312 is shown with some rules 
already in place, i.e., having values set, and other rules left blank. These rules are 
shown as examples only, and, as with any of the other types of rules, any or all of 
the rules can be set by the user. A typical data coloring rule 130 is shown in 
Figure 5A. The rule 130 has already been set up and operates on the appropriate 
data when selected by a user (using mouse 110, FIGURE 1 or some other pointing 
device) in the area 132 marked "Click here to run these". The rule 130 (as with all 
of the preferred color control rules) has four parts, namely the name of the sheet 
134 containing the data on which the rule is to operate ("DEMO 1" in the example 
of Figure 5 A); the columns 136 of data of the sheet on which the rule is to 
operate ("E" in the example of Figure 5A); the number of colors 138 to be used 
by the rule 130; a number of breakpoints 140 (denoted "break 1" to "break 4" in 
the example of Figure 5A); and a corresponding number of colors 142 for each 
range defined by the breakpoints (denoted "color 1" to color 4" in the example of 
Figure 5A). Actually, as explained below, the number of breakpoints is one less 
than the number of colors.. In the specific example shown in Figure 5 A, the rule 
has three breakpoints of 1 , 5 and 10, defining four ranges vdth four corresponding 
colors 142, namely light green, yellow, orange and red. Preferably the named 
colors are also depicted in the actual colors, so that, in this example, the 
background of the word "lightgreen" is shown in light green, the background of 
the word "yellow" is shown in yellow and so on. 

In this invention it is preferable to show data and meta-data (headings etc.) 
in color. In some embodiments, the coloring is implemented by showing a 
background area of the text representing the data in the appropriate color. 

Sometimes the actual text representing the data is shown in the appropriate color. 

12 
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In presently preferred embodiments, the font color is only changed in cases where 
necessary to improve contrast with the background color for readability. Only two 
font colors, dark (black) and light (pale gray), are used in the presently preferred 
embodiment. Combinations of both approaches can be used. For example, the 
background section of the word "yellow" is preferably shown in the color yellow. 
It is also possible to show the word itself, i.e., the font, in the color yellow, as long 
as that color is distinguishable from the background. 

The particular rule 130 shown in Figures 4A and 5 A, operates as follows, 
when selected: 

In sheet "DEMO 1" 302, in column E, values less than or equal to 1 (break 
1) are colored light green (color 1); values in the range 1 to 5 (between break 1 
and break 2) are colored yellow (color 2); values in the range 5 to 10 (break 2 to 
break 3) are colored orange (color 3); and values greater than 10 (break 3) are 
colored red (color 4). 

Another typical data coloring rule 130-1 from the color control sheet 312 
is shown in FIGURE 5B. The rule 130-1 is set up to operate on columns "C" and 
"D" of sheet "DEMO 1". The rule 130-1 uses three (3) breakpoints (break 1=0.1, 
break2=l and break3=5) defining four ranges with four (4) corresponding colors 
("lightgreen", "yellow", "orange", and "red"). 

The results of applying the rule 130-1 of Figure 5B to the data in sheet 

"DEMO 1" (302, Figure 3) are shown in Figures 7A-7B. As can be seen from 

Figures 7A-7B, after application of the rule 130-1, all of the data in columns C 

and D of the sheet labeled "DEMO 1" has been colored according to the rule. 

Specifically, data having a value less than or equal to break 1 (0.1) have been 

colored light green; data values in the range between break 1 and break 2 (0.1 to 

13 
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1) have been colored yellow; data values in the range between break 2 and break 3 
(1 to 5) have been colored orange; and data values greater than break 3 (5) have 
been colored red. 

The results of applying all of the other color control rules shown in 
Figures 4A-4B to the data in sheet "DEMO 1" are shown in FIGURES 8A-8B. 
The rules can be applied individually (as shown above with respect to 
Figures 7A-7B), or they can be all be applied at the same time. In order to apply 
all rules to a particular data set (sheet), each rule can be individually selected or 
the area labeled "RE-RUN ALL rules for SHEET NANfED DEMO 1" (on the right 
side of Figure 4A) can be selected. Note that if two rules apply to the same 
column of the same sheet, the second rule run on that column will override the 
first rule run on that column. 

To create a coloring rule a user performs the following (with reference to 
Figure 6B): 

(1) Select the "COLOR CONTROL" sheet 312 and pick a control panel on 
that sheet to use (an empty panel or one containing a rule no longer 
needed) (at 600). All control panels on a sheet can be cleared by 
clicking the button labeled "Clear all entries on this sheet" 
(318 in Figure 4A). 

(2) In the selected control panel, enter the name of the sheet to be 
colored (at 602). 

(3) In the selected control panel, enter a column or columns (at 604). 

For multiple columns, either list them separated by commas, or use 

a colon or hyphen to denote ranges, or some combination. For 

example, "A:D,F" means columns A,B,C,D, and F. To aid in 

14 
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choosing columns, the user can right-click on the cell containing 
the name of the data sheet, and pick "Open Twin Screen" from the 
shortcut menu that appears, to create a special dual display. This 
also creates a "Close Twin Screen" button to go back. 

(4) Choose a number of colors to use (at 606), either by entering the 
number of colors or by repeatedly clicking the gray button adjacent 
the cell labeled "# of colors". In preferred embodiments, the 
system allows for five breakpoints and six colors per rule. 
Accordingly, the numbers will cycle from 1 to 6, and various cells 
below them will be blacked out accordingly. 

(5) Enter the breakpoints that define the color groups (at 608), in any 
of three modes: 

a) Numeric data, manual mode : enter numbers to form 
the breakpoints, i.e., the bovmdaries between the 
color groups, one less than the number of colors, in 
increasing numerical order. Cells whose values 
exactly equal a breakpoint value will be colored 
with the lower group (breakpoint 1 is colored with 
color 1, etc.) 

b) Numeric data, automatic mode ', enter either "value", 

"log", or "count'* as the first breakpoint. If multiple 

coltimns have been chosen, the user must also enter 

"yes" or "no" opposite "Re-scale all?" at the bottom 

of the panel, to indicate whether each column 

should get its own auto-breakpoints, or whether the 
15 
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auto-breakpoints of the first column (first in list in 
the rule, not first on the data sheet) should be used 
for all. 

This mode reports information about the 
5 breakpoints it determines, and thus could also be 

used to explore the distribution of numerical values 
in a column prior to a final manual breakpoint 
selection. 

c) Text data : enter the strings to be matched and 
10 colored, in preferred embodiments, up to five (5) in 

number. Matching is case-insensitive unless the 
string is enclosed in double quotes (" and 
otherwise, no quotation marks are necessary. 
Several special text strings act as operators if entered as 
15 the first word in a rule cell: 



Rule Entry 

(OPERATORS need not 
be uppercase — here only 
for emphasis) 


Meanine 


test string 


color data cell if its whole content matches the test string 


NOT test string 


• color data cell if its whole content does not match the 
test string; 

• will not color numeric cells 


CONTAINS test string 


color data cell if contains the test string as a substring 
anywhere 


NOTCONTAINS test 
string 


• color data cell if it does not contain the test string 
anywhere; 

• will not color numeric cells 
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Rule Entry 



(OPERATORS need not 
be uppercase — here only 
for emphasis) 



Meaning 



BEGINS test string 



color data cell if it begins with the test string 



ENDS test string 



color data ceil if it ends with the test string 



* (an asterisk) 



(wildcard) color data cell containing any data, including 



numeric cells 



BLANK 



color data cells that are blank 



Using quotes to force matching to be case-sensitive also 
works with strings that follow an operator. 

It is possible to construct a text-coloring rule in 
which certain cells may satisfy more than one of the 
"breakpoint" values. For example, if a rule says that 
"active" is colored green and "contains act" is 
colored red, then the word "active" in a cell would 
satisfy both. In such cases, the system colors the 
cell according to tbs first condition satisfied on the 
list of breakpoints. This dependence on the 

order can be used advantageously to achieve 
complex coloring conditions. The sequence of 
conditions can be considered as a series of filters, 
through which only the as-yet-imcolored cells fall 
through to the next decision. 



17 



wo 01/08039 PCT/USOO/20401 

(6) Enter the names of the colors to use (at 610), in the order 
corresponding to the breakpoints. A display of color samples is 
provided at the right side of the Color Control sheet 312. A user 
need only enter the name, and the appropriate cell will become 
colored when the tool is executed. If the user wants the color to 
display immediately, he can copy and paste the sample cell into the 
rule's color cell. A special pseudo-color named "SKIP" is used to 
tell the system not to color the cells whose data falls in this group. 

(7) When the rule has been created, the user executes the rule by 
selecting the rule's "Click here to run these" button on the 
panel filled in (at 612, Figure 6C). 

(8) To create different coloring rules for other colmnns, repeat the 
above in additional control panels. If the user runs out of control 
panels, he can create more control panels by copying an existing 
one and pasting it onto a blank section of the color control sheet. 

To the extent that a single panel cannot hold all the requirements for a 
particular rule, a user can combine two or more panels to create a particular 
rule. For example, if a user needs ten (10) breakpoints, two panels can be 
used. 

With reference to the coloring rule is shown in Figure 6 A, once the 
rule has been set, a number of parameters are stored in the system. The 
parameters are "sheet name" ("DEMO 3" in Figure 6A), colunm 
specification, number of colors, array of breakpoints, array of colors, and 
multicolumn scaling mode. 
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The data coloring mechanism operates as follows, with reference to the 
flowchart of Figure 6C.: 

1 . The user enters the parameters into a rule panel on a "COLOR 
Control" worksheet 312, e.g., as described above with reference to the panel of 
Figure 6A. 

2. The user selects (clicks) the activation button (labeled "Click here 
to run these") on that panel (at 612). This causes the system to: 

(A) Read and interpret the parameters from the panel (at 614). 
The system can identify which button was clicked using the Visual Basic ("VB") 
"caller" property, described in more detail below. The parameters are then read 
based on the identity of the cell location of the button using the VB "TopLeftCell" 
property. The system retrieves the parameters (sheet name, column specification, 
number of colors, array of breakpoints, array of colors, and multicolumn scaling 
mode) from cells in this panel by relative reference to the button cell. 

(B) Next, determine the mode of the coloring rule (at 616) (i.e., 
numeric v. text or manual v. automatic, and, if automatic, which of value, log or 
count). This uses the analysis of the first breakpoint entry. 

(C) Compile a list of the columns specified in the "column 
specification" parameter (at 618). This is done by scanning the various areas 
contained in the selection, as follows: 

For Each singlearea In Selection. Areas 
For Each c In singlearea .Columns 
If Not CountEmpty Then 

Ir = LastRowInColumn (c . Column) 
End If 
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If Not Count Empty And Ir = 0 Then 
'skip this empty column 
ncols = ncols - 1 
Else * add this column to the list 

icol = icol + 1 
colnumarray (icol) = c. Column 
End If 
Next c 
Next singlearea 

(D) If an auto-breakpoint mode is being used (determined at 
620), analyze the data values to determine the breakpoints (at 622). This is done 
by: 

(i) Collecting statistics on the data distribution in each 
specified column; and 

(ii) Calculating the automatic break points for the 
appropriate mode. For example, the auto-value breakpoints 
are determined as follows: 



If breakmode = "VALUE" Then 

interval = (maxvalue - minvalue) / ncolors 

break (0) = minvalue 

For ibreak = 1 To ncolors - 1 

break (ibreak) = break (ibreak - 1) + interval 
Next ibreak 

(iii) Displaying the results for user approval or 

cancellation. 

(C) Loop through the cells in the chosen columns on the chosen 
worksheet (at 624). 
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(D) Compare each cell's value to the list of breakpoints (at 
626). If the coloring rule is in text mode, use the meanings of the special 
breakpoint operators ("contains", "blank", asterisk "*"; or quotation 
marks). 



(E) When a match is found, apply the appropriate color (at 



628). 



The code below illustrates the processes (D) and (E) for numeric 
breakpoints: 

10 

For Each cell In Range (Cells (Start ColoringRow, colnum) , 

Cells (Finished or ingRow, colnum) ) 
cvalue = cell. Value 
colored = False 
16 If IsNumeric (cell .Value) Then 

If Not IsEmpty (cell) Then 

' (have to test Jboth because EMPTY is numeric) 
For ibreak = 1 To ncolors - 1 

If cvalue <= break (ibreak) Then 
20 If Color (ibreak) <> "SKIP" 

cell . Interior . Colorlndex = 

Color ( ibreak) 
Call Text Contrast (cell) 
End If 

25 colored = True 

Exit For 
End If 
Next ibreak 
If colored = False Then 
30 ' Not -hit yet? Must be top category, so: 

If Color (ncolors) <> "SKIP" 

cell . Interior . Colorlndex = Color (ncolors) 
Call TextContrast (cell) 
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End If 



colored = True 



End 



If 



End 



If 



Else 



' not numeric - just don't color it 



10 
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End If 
Next cell 

The operation of the data coloring tool of this invention will now be 
described in greater detail. Each coloring rule is provided in a coloring control 
panel that has the general form of a coloring rule as shown in Figure 6A. In one 
preferred embodiment, each coloring control panel 144 is a double-outlined unit, 
sixteen (16) cells high by two (2) cells wide. As noted above, a user is provided 
with coloring control panels on the color control worksheet 312. A user can use 
the coloring control panel 144 to set the sheet and column(s) on which the rule is 
to operate, the number of colors, the various break points and the colors associated 
with those breakpoints. The sheet is set by entering its name into the cell 146 
adjacent the cell labeled "sheet". The column (or columns) on which the rule is to 
operate is (are) set by entering its (their) name in the cell 148 adjacent the cell 
labeled "column(s)". The number of colors is set by the user by selecting the cell 
150 adjacent to the cell labeled "# of colors". Each time the cell 150 is selected it 
increases the number of colors, up to a maximum of six (6), i.e., rotating through 
the values 1 to 6. I.e., when the cell 150 shows a "6" and is selected, it reverts 
back to "1". That is, selecting the cell 150 causes the value in the cell to cycle 
from "1" to "6" and then back to "1". 

A Visual Basic ("VB**) macro function C^CallColorColumn**) is associated 
with the top cell 152 of the control panel 144. When the cell 152 is selected by 
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the user (with the mouse 110 or the like), the function associated with that cell is 
executed by the computer (CPU 102). In the presently preferred embodiments, the 
CallColorColumn ftmction extracts the button name of the cell 152 and then calls 
a second function Q'CallColorColumnSubroutine^^) with that button name as one 
of the parameters. The function CallColorColumnSubroutine takes three 
parameters, namely ButtonName^ StartColoringRow, and FinishColoringRow. 
The two parameters StartColoringRow, and FinishColoringRow are optional. 

First, the function CallColorColumnSubroutine determines what specific 
values to use for the coloring by reading them from the control panel 144. Since 
the values are all in fixed positions relative to the selected button cell 152 that 
initiated the call to the function CallColorColumn^ the values can be determined 
once the location of that button cell 152 has been determined. This is done using 
the following Visual Basic code: 

Sheets { "Color Control" ) .Activate 
headingrow = 

ActiveSheet . Buttons (ButtonName) .TopLef tCell .Row 
headingcol = 

ActiveSheet . Buttons (ButtonName) . TopIieftCell . Column 

Note that if the function CallColorColumnSubroutine was called from 
another sheet (not "Color Control*') then this method will not find it. 
The various parameter values are then read as follows: 
Sheet name: 

datacol = headingcol + 1 

sheetname = Trim (Cells (headingrow + 1, datacol) .Value) 
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If there is no sheet named "sheetname" an error function is called. 
Generally, in preferred embodiments, a great deal of error checking takes place to 
ensure that the user is provided with a friendly and useable interface to the 
program. Most of the error checking is not mentioned in this description, 
however, one skilled in the art would know what kinds of error checking to 
implement in order to provide a user-friendly working environment. 
The column(s) to be colored are specified by: 

Cells (headingrow + 2, datacol) .Value 

The number of colors is specified by the variable ncolorSy where: 

ncolors = Cells (headingrow + 3, datacol) .Value 

Within the function CallColorColumnSubroutine there are two arrays, 
named break and color, which are used to store the breakpoints and colors, 
respectively. The first breakpoint is set as follows: 

break(l) = Cells (headingrow + 4, datacol) .Value 

The vedue of the first breakpoint is used to determine the break mode 
("NUMERIC", "VALUE", "LOG", or "COUNT"). \fbreak(l) (as determined 
above) is numeric, then the mode is set to "NUMERIC", otherwise, if break(J) is 
one of "VALUE", "LOG", or "COUNT", then the break mode is set to that mode, 
otherwise the break mode is set to "TEXT*. 
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10 



15 



20 



Next, the function determines whether multiple columns were specified, in 
which case it determines whether the user selected to re-scale all the columns. 
The user's re-scale selection is determined by: 

rescale_all_string = Cells (headingrow + 
15,datacol) .Value 

Now the rest of the breakpoints (if any) are read. If the break-mode is 
"AUTO" then the breakpoints are set as follows: 

For i = 2 To lastbreaknum 

break(i) = Cells (headingrow + 3 + i, datacol) .Value 

Various possible errors are checked for. E.g., if any breakpoints are 
missing (i.e., if break(I) is empty, the user is notified. Also, if the break mode is 
"NUMERIC" and non-numeric breakpoints are set, the user is notified. If 
numeric breakpoints are not in increasing order, the user is notified. As noted 
above, generally, in preferred embodiments of the present invention, a great deal 
of error checking is performed on all user inputs to ensure that the values are 
correct and consistent. Most error checking is omitted firom this description. 

The CallColorColumnSubroutine maintains an array, coiorname, which 
maps integers to colors, hi preferred embodiments, there are fifty six (56) colors 
available. To use the higher numbered colors, the computer's video card must be 
set appropriately. Using the coiorname array, the program next associates the user 
provided color names with index numbers. Specifically, for each of the (up to six 
in a preferred embodiment) colors specified, the user specifies an actual color 
name, denoted cname. This name is determined for each color by: 

For j = 1 To ncolors 



cname = Cells (headingrow + 8 + j, datacol) .Value 
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The interior of each color-specifying cell is then colored by the appropriate 
(selected) color by setting the color property {Interior.Colorlndex) of the cell: 

Cells (headingrow + 8 + j, datacol) . Interior . Colorlndex = 
Color (j ) , 

where the value of the variable j ranges from 1 to ncolors. 
Then the cell is further processed by a frinction TextContrast. 

Call TextContrast (Cells (headingrow + 8 + j, datacol)) 
With the parameters read and checked, the system is ready to process and 

color the selected sheet (specified at cell 146 in Figure 6A). The selected 

columns (specified in cell 148 in Figure 6A) in the selected sheet are processed 

one-by-one by the following program code: 

Call Parselnput (InString, inspecifier) 
For Each singlearea In Range (inspecifier) .Areas 
For Each c In singlearea. Columns 
colnum ^ c. Column 

Call ProcessOneColumn (colnum, ncolors, break. 

Color , breakmode , 
rescale_all, sheetname, 
StartColoringRow, 
Fini shColor ingRow ) 

Next c 
Next singlearea 

The processing performed by the function ProcessOneColumn is as 
follows: The colunm designated by colnum on sheet sheetname is to be colored 
according to the breakpoints in the array break and the colors in the array colors. 
The designated column is colored from the row corresponding to 
StartColoringRow to the row corresponding to FinishColor ingRow. Note that the 
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function ProcessOneColumn is also provided with the break mode and the 
variable rescaie_alL 

Function ProcessOneColumn first calculates the automatic breakpoints, if 
necessary. Note that automatic breakpoints are determined from the whole 
5 . column, even if this call says to color only a limited range of rows. If the value of 
breakmode is "VALUE" or "LOG" and the value of rescale_all is set to "True" Or 
the value of the first breakpoint (break(l)) is set to "VALUE" or "LOG", the 
program calls the function AutoValueBreakpoints as follows: 

Call AutoValueBreakpoints (colnum, colletter, ncolors, 
10 break r Color, breakmode, rescale_all) . 

Otherwise, if the breakmode is set to "COUNT" and the value of 
rescalejill is set to "True" or the first breakpoint {break(l)) is set to "COUNT", 
then the program calls the function AutoCountBreakpointSy as follows: 

15 Call AutoCountBreakpoints (colnum, colletter, ncolors, 

break. Color, breakmode, rescale_all, sheetname) . 

With the breakpoints calculated, the columns are colored according to the 
type of breakpoints specified by the user. Specifically, when the breakmode is any 
20 one of "VALUE", "COUNT", "LOG", or "NUMERIC", the system executes a 

function ColorNumericColumn. On the other hand, when the breakmode is 
"TEXT", the system executes a function ColorNumericColumn. The VB code for 
this is as follows: 

25 Select Case breakmode 

Case "VALUE", "COUNT", "LOG", "NUMERIC" 
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Call ColorNumericColumn { collet ter, ncolors, 
break. Color, StartColoringRow, FinishColoringRow) 
Case "TEXT" 

Call Col orTextColumn (collet ter, ncolors, break. 
Color, StartColoringRow, FinishColoringRow) 
End Select 

Then, when the rule in control panel 144 is selected for execution, the rule 
is applied to the selected colunin(s) (denoted in cell 148) of the named sheet (in 
cell 146). For each column in the named sheet, the value in each cell is compared 
to the various breakpoints and the cell is colored corresponding to the appropriate 
breakpoint. 

Examples of the application of various coloring rules in the "COLOR 
Control" worksheet of Figures 4A-4B, are shown with reference to the data in 
worksheet "DEMO 2" (depicted in Figures 9A, 9B, lOA and lOB). 

Color-Mediated Data Mining 

As noted above with reference to Figure 2, once the data have been 
colored according to the user-selected rules (at 126), the user can then perform 
color-mediated data mining (at 128). The presently preferred embodiment of this 
invention provides five mechanisms (each discussed below) for color-mediated 
data mining, namely mechanisms to: 

1 . enlarge/shrink cluster starts; 

2. vertically re-scale the display; 

3. score and sort clusters; and 

4. score and sort dose-response data. 
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The following discussion refers to the cluster control worksheet which is 
shown in Figures IIA-IIB. 



1. Enlarge/Shrink Cluster Starts 

The "Enlarge Cluster Starts" mechanism highlights the first row of each 
cluster in clustered data by enlarging the font of the cell containing the cluster 
number or label, thus enabling size reduction of the spreadsheet for the user to 
focus on the color patterns. When the cell height is dramatically reduced in order 
to see more cells on a screen or printed page, this enlargement allows the user to 
still read the label at the beginning of each cluster. The mechanism takes user 
input from a Cluster Control worksheet. A corresponding mechanism ("Shrink 
Cluster Starts") allows for undoing the enlarging. This mechanism handles 
cluster niunbers or textual labels. Any column can be designated as the cluster 
labels to be processed. 

Operation of the mechanism is as follows: 

(1) From the "Cluster Control" sheet 314 pick a control panel to 
use (one which is empty or one containing inputs no longer 
needed). On this sheet, a single control panel extends vertically 
through the black, blue, red, and green sections, and provides input 
information for several tools. 

(2) In the blue section, enter a sheet name and the column to be 
considered as the cluster labels. 

(3) Click either the blue-text "Enlarge Cluster Starts" or "Shrink 
Cluster Starts" button. 
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The program code accomplishes this by scanning the column of cluster 
labels, identifying any entries that are different from the one immediately above, 
and enlarging them. Code that carries out this function is shown below: 



5 For Each c In Range (Cells (3 , colnum) , Cells ( lastrow, 

colnum) ) 

irow = c.Row - 1 
icol = c. Column 

If c. Value <> Cells(irow, icol) .Value Then 
10 c. Font. Size = bigfontsize 

' Rows (Xrow + 1) .RowHeight = blgrowhelght 

End If 
Next c 

Example 

15 An example of the application of the enlarge cluster mechanism is shown 

in Figure 12 which shows the application of a rule (shown in the control panel 
Figure IIC) from the cluster control worksheet in Figure IIB to the data of 
worksheet "DEMO 2" as shown after coloring in FIGURES lOA-lOB. As shown 
in Figure IIC, the rule is to be applied to column B of sheet "DEMO 2". 

20 

2. Vertical Display Re-Scaling 

The vertical re-scaling mechanism operates by taking a user-provided scale 
factor and then changing height of data rows to facilitate visualization of large- 
scale color pattems. The mechanism leaves column heading heights and column 
25 widths unchanged. This makes headings remain readable and greatly simplifies 

examining long columns of data for color pattems. 
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Figures 13A-13D show the application of the vertical display re-scale 
mechanism according to the present invention. Figures 13A-13B show some of 
the data in the worksheet labeled "DEMO 3*' 306 (Figure 13A shows the first 
thirty eight or so elements and Figure 13B shows the remaining elements of that 
worksheet). As can be seen from the figures, the worksheet "DEMO 3" 306 has 
three hundred and twenty eight (328) data entries (in rows 2-329). The user can 
vertically scale the display by selecting "Re-scale Vertical" from the system's 
special menu or by pressing a particular control key sequence (e.g., "CNTL- 
SHIFT-V" in a preferred embodiment). This presents the user with a dialog box 
318, as shovm in Figure 13C, which asks the user to enter a scaling factor 
relative to the current size. The user enters a scaling factor to enlarge or reduce or 
restore the display. In the example shown, the user enters a scaling factor of 0. 1 
which produces the vertically scaled display shown in Figure 13D. 

Vertical scaling allows a user to get an overview of the data, based on the 
coloring. 

The portion of the program code presented below carries out the central 
function of the vertical display rescaling mechanism: 

rowspec = "2:" & last row ' leaves the headings unchanged, 

i.e./ readabl e 

If factor = -1 Then 

Rows (rowspec) .Rows . AutoFit 

Else 

For irow = 2 To las trow 

Rows (i row) .RowHeight = 

Rows (irow) .RowHeight * factor 

Next irow 
End If 
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After execution of the rescaling mechanism, as can be seen in 
Figure 13D, the height of each row (except the heading rows) has been scaled by 
factor^ 0.1 in the example shown. In this manner, all rows of the data are made 
visible on a single page, thereby facilitating data analysis. 

3. Scoring and Sorting Clusters 

Scoring and sorting clusters assign nxunerical scores to the color patterns 
of individual rows or clusters of rows, thereby enabling comparison and sorting of 
the clusters by score. 

The scoring and sorting mechanism accepts user's designations of colors 
and corresponding relative scores. It handles cluster numbers or textual labels. 
Any column can be designated as the cluster labels to be processed. The 
mechanism scores a user-selected list of columns of data, with user-assigned 
relative weights, which need not be equal for all columns. 

User input is taken from a Cluster Control worksheet 314 (see 
Figures 11 A and IIB), which stores any number of parameter sets, each one 
with a user-specified name. 

The input data is automatically sorted by cluster label before starting, in 
order to group the clusters together in case the user has previously sorted the data 
by some other criterion. Then scores are normalized to remove the effects of 
cluster size, absolute magnitude of scoring points chosen, and absolute size of 
weights chosen. The results are written to two new worksheets without altering 
the original data sheet. The first derived sheet is for the numerical scores; the 
second is like the original, but has the clusters sorted into descending score order, 
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SO that the "best" are at the top, removing the need to visually scan a long colored 
worksheet. The derived output sheets have names that indicate their source data 
sheet and the name of the parameter set used for scoring. At the user's option, the 
system reversibly hides the im-scored columns in the cluster-sorted output sheet, 
focusing attention on the data that were used in scoring. 

In preferred embodiments, the system detects uncolored cells in the data 
and offers the user two progranmied modes of dealing with them, (vmcolored = 
entry on user's list of scores or uncolored = "average of other colors in row"), or 
the option of stopping to color them manually. 

If the user designates a column of individual compound labels as the 
"cluster labels," then the system compares single compounds rather than clusters. 

The mechanism operates as follows, with reference to FIGURES IIA-IIC. 

(1) On the "Cluster Control" sheet 314 the user picks a control panel (e.g., 
panel 1100) to use (a panel which is empty or one containing non- 
needed inputs). On this sheet, a single control panel extends vertically 
through the black, blue, red, and green sections, and provides input 
information for several tools. 

(2) In the top black section 1102 of the selected control panel 1100, the 
user gives this new parameter set a name if not already done. The 
name will be used to label the outputs. 

(3) In the blue section, the user enters a sheet name (in 1104) and the 
colunm (in 1106) to be considered as the cluster labels. Note: To score 
each compound separately rather than in clusters, enter a column with 
individual compound labels as the "Cluster Col." 
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(4) The red section of the control panel is divided into two parts, with its 
action button 1108, with red text "Score and Sort Clusters", in the 
middle. Above the button, enter the names of the colors 1110 to be 
assigned point scores, along with their corresponding point scores 
1112. The scores are arbitrary and relative; they will be normalized by 
the system as necessary. However, a user should be sure always to 
assign higher point scores to colors which denote favorable values, and 
lower point scores to colors which denote unfavorable values. The 
cells with entries need not be colored, and need not be in score order, 
because the system will color and sort these cells when run. 

When assigning point values, a user should be aware that 
uncolored cells (which are most likely blank, i.e., unknown data) may 
have quality values above or below those that contain grouped and 
colored data.. The user may decide that some of the colored groups 
are "better" or "worse" than data being unknown, and can assign a 
score to the color "NONE" accordingly. 

(5) Below the "Score and Sort Cluster" button 1108, the user enters the 
columns 1114 to use for scoring (using the same syntax as for the Data 
Coloring) and their corresponding relative weights 1116. The numbers 
for weights are arbitrary and relative; they will be scaled by the system 
as necessary. Note that a line with multiple columns vsdll assign the 
entered weight to each of the columns. 

(6) The user the selects (clicks) the red-text "Score and Sort Clusters" 
button 1108. 



34 



wo 01/08039 PCT/USOO/20401 

(7) When the scoring and sorting tool runs (on the system 100), if the 
system detects uncolored cells in the data, the user will be offered two 
modes of dealing with them automatically, or a third manual option of 
stopping to color them. The two modes are: 

• "Use score for the color "none" on my list" 
(RECOMMENDED) 

• "imcolored = average of other colors in row". 

(8) The program then scans the chosen columns in each row and adds up 
the chosen column's color scores for that row. These scores are then 
averaged for each cluster of rows, as defined by the user-selected 
"cltister coltram." The VB program code which accomplishes this is as 
follows: 

For icol = 1 To ncols 
colorcode = 

Cells (irow,colnum( icol) ) . Interior .Colorlndex 
colorfound = False 
' Add up the weighted scores 
For j = 1 To ncolors 

If ( i color (j) = colorcode) Then 
j score = j 
colorfound = True 
Exit For 
End If 
Next j 

If colorcode = xlNone And treatblanks = "AVERAGED" 
Then colorfoxmd = True 

If Not color foimd Then 

Cells (irow, colnuin(icol) ) .Select 
If colorcode = xlNone Then 
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thiscname = "none" 

Else 

Call ColorNameToIndex (thiscname, colorcode. True) 
End If 

5 addscore = score (jscore) * colweight (icol) 

cmpdscore = cmpdscore + addscore 
' Next IF-THEN-ELSE block is 

' special calculations for the "averaged" mode 
If colorcode = xlNone Then 
10 lostweight = lostweight + colweight (icol) 

minscore = Application.Min (minscore, 0) 
maxscore = Application.Max (maxscore , 0) 

Else 

cmpdscore 2 = cmpdscore 2 + addscore 
16 End If 

Next icol 



( 9 ) The scores are then normalized for the various cluster sizes (number 
of rows per cluster), and scaled to a value of one hundred (1 00) for a 
20 row which is colored entirely with the user's highest-scoring color and 

a value of zero for a row that is colored entirely with any color to 
which the user has assigned a score of zero. 

If clusterscore (icluster) = 0 Then 

' do nothing 
Elself clusterscore (icluster) > 0 Then 
clusterscore (icluster) = 

100 * clusterscore (icluster) / (nrows * maxscore) 
Elself clusterscore (icluster) < 0 Then 
clusterscore (icluster) = 

100 * clusterscore (icluster) / 
(nrows * (-minscore)) 

End If 
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(lO)The results are presented as two newly inserted worksheets. The first 
is named by appending the word "SCORES" to the name of the 
original data sheet, and contains a list of the clusters with their sizes 
and scores. 

5 (1 l)The second new sheet is named by appending the word "SORTED" to 

the name of the original data sheet. The "SORTED" sheet contains a 
copy of all the original data and coloring, but with the rows re-ordered 
to place the highest-scoring clusters at the top, and all the clusters in 
descending score order from there down. 

10 (12)The user has two additional options regarding the appearance of the 

"SORTED" sheet: (a) a column containing the numerical scores can be 
added; and (b) the columns that were not used in the scoring can be 
hidden, so that only the ones actually used remain visible. 



15 An example of user provided data is shown in the control panel in 

Figure IID which is taken from the cluster control worksheet shown in 
Figure llA. As shown in Figure llA, the parameters are stored with the name 
"Cmpd" 1102. The scoring a sorting parameters in the control panel 1100 of 
Figure IID give the color red a score of "-1", orange has a score of "0", yellow 

20 has a score of "1" and light green has a score of "2". Columns C and D have 

relative weights of "1", as does column E. 

Note on the output of score and sort clusters: The system inserts two new 

sheets after the data (see, e.g.. Figures 14C-14D). The first added sheet contains 

two score columns: the scores generated by both of the auto modes (uncolored = 

25 zero and uncolored = average), but the one not selected will be gray. The scores 
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are on a scale of "-100" to "+100", where a score of "-100" means that all cells 
had the maximally negative score available, and a score of "+100" means that all 
cells had the maximally positive score available. The second added sheet has 
clusters sorted according to the one auto mode chosen when the tool ran. The 
routine offers to hide all columns that were not used in the scoring and sorting. 
The user can selectively unhide certain columns by using the "Edit:GoTo" menu 
option (or typing "CTRL-G'O, enter the columns in the "Reference" box (for 
example, C:F), then pick the "Format:Column:Unhide" menu option. 

If the user wants to see a color-score-sorted list of compounds within a 
particular cluster (such as the best cluster), the user should do the following: 

1 . Sort by clusters to find the ID of the cluster wanted. 

2. With a second rule, sort by compoimds. 

3. Go to the "SORTED by Compound" results sheet and turn on Excel' s 
"DataiFilter: AutoFilter" feature for the column that specified the 
clustering in the first sort. The user can then choose to view only the 
compounds in one particular cluster, and they will be in compound- 
sorted order. 

Example 

With reference to the already-colored worksheet "DEMO 1" shown in 
Figures 8A-8B, the cluster control worksheet shown in Figure 11 A, and the 
control panel shown in Figure IID, application of the scoring and sorting of 
clusters is described. As noted above, in the control panel of Figure IID, the 
parameters are stored with the name "Cmpd" 1102. The color red has a score of 
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"-1", orange has a score of "0", yellow has a score of "1" and lightgreen has a 
score of "2". Columns C and D have relative weights of"!", as does column E. 

Application of control panel "Cmpd" of Figure IID, by selecting "Score 
and Sort Clusters", produces the worksheets shown in Figures 14A-14B. When 
the user selects the "Score and Sort Clusters" button 1108 for the "Cmpd" control 
panel of Figure IID, the system first presents a dialog box (1402 shown in 
Figure 14 A) asking the user how un-colored cells should be scored for sorting. 
As noted above, un-colored cells can be scored explicitly by user entries 
(recommended) or as the average of the colors in the same row. Once the user 
makes a selection and clicks on the "OK" button, the system scores and sorts the 
data, producing the display screen shown in Figure 14B. The system provides a 
summary of what was done, including the information about the two new sheets 
("DEMO 1 SCORES by Cmpd" and "DEMO 1 SORTED by Cmpd un^ze") 
which are added to the workbook. Figures 14C-14D show the data in the newly 
created worksheet "DEMO 1 SCORES by Cmpd". 

Example 

With reference to the already-colored worksheet "DEMO 2" shown in 
Figures lOA-lOB, the cluster control worksheet shown in Figure 11 A, and the 
control panel shown in Figure IIC, application of the scoring and sorting of 
clusters is described. In the control panel of Figure IIC, the parameters are 
stored with the name "acids" (1102). The color red has a score of "0", orange has 
a score of "1", yellow has a score of "2" and light green has a score of "3". 
Colunm D has a relative weight of "1". 
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The application of the parameters or rules in the "acids" control panel 
produces two new worksheets ("DEMO 2 SORTED by acids" and "DEMO 2 
SCORES by acids") shown in Figures 15A-15B. 



4. Score and Sort Dose-Response Data. 

Data grouping and visualized by color coding has also been found to 
enable an automated solution to another vexing pattern recognition problem. An 
HTS lab is currently able to provide dose-response data on some subset of the 
whole collection of compoxmds originally tested. Sometimes, logistical 
constraints (time and/or cost) dictate that only a few concentration points can be 
run on each compound, and the high-throughput nature of the process generates 
somewhat noisy data. A similar situation sometimes exists in other biological 
laboratories where assays are very time-consuming. Dose-response curves with 
few, noisy points are difficult to analyze by traditional curve-fitting methods. The 
present invention includes a mechanisms/algorithms for analyzing percent-of- 
maximal-efTect data and accurately ordering the compounds by potency, even 
when faced with few points and high noise. 

The mechanism recognizes two properties of the dose-response data for 
each compound: 

1 . "Dose-responsiveness," the drop-off of activity with dilution, is taken 
as a sign that the compound has some reasonable pharmacological 
mechanism of action. 



40 



wo 01/08039 PCT/USUO/20401 

2. The activity measurements at the various concentrations also provide a 
confirmation of the general level of each compound's activity that was 
indicated by the original single-poke HTS hit. 

These two properties are somewhat independent, as illustrated by the 
example of a compound that is 95% active at all tested concentrations. It 
demonstrates very poor (i.e., no) dose-responsiveness over the range of 
concentrations tested, but is so active that it should not be ignored, because it 
might reveal a dose response if tested at even lower concentrations. 

By using the data groupings and color codes of the dose-dependent activity 
data columns, which help to smooth out excessively fine distinctions in the 
numbers, this invention includes an algorithm to assign numerical scores for dose- 
responsiveness and overall activity in the dose-response data. Moreover, the 
algorithm also calculates a smart composite of these two scores, in such a way that 
a highly active compound will get a high composite score even if its dose- 
responsiveness is poor. This composite score is capable of extracting useful 
information, even from very noisy data, and has been validated to correctly order a 
list of test compounds. The system of this invention adds data columns that report 
all three scores for each compound, and these columns can themselves be color 
coded, and thus used in further comparison to other types of data by compound or 
cluster scoring and sorting as described above. 

Moreover, within certain limits, the invention's dose-response scoring 
algorithm can also be used to make quantitative estimates of ICso values of 
compounds, even in the presence of large amounts of experimental error. This is 
accomplished by adding a set of hypothetical marker compounds with known 
potencies and theoretically calculated activities at the test concentrations. Since 
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the ordering algorithm is reliable, these markers will be ordered into their 
appropriate place, and can be used to calibrate the ordering scores in terms of 
actual ICso's. In other words, estimates of IC50 for the compounds can be 
generated by interpolating between the markers in the ordered list of composite 
scores. 

Scoring and sorting dose-response data according to the present invention 
processes several columns of colored dose-response data (activity vs. 
concentration) to assign three numerical scores that can later also be color coded, 
and thus used by the "Score and Sort Clusters" mechanism (described above) to 
compare compounds or clusters of compounds. The three scores are: 

(a) degree of dose-responsiveness over the concentration 
range tested; 

(b) overall activity level demonstrated in the dose-response 
data; and 

(c) a variably weighted composite of (a) and (b), designed 
to give high scores for high activity even when dose- 
responsiveness is poor (e.g., a compound that is highly 
active at all concentrations). 

The scoring and sorting dose-response data according to the present 
invention bases its scoring on colors rather than absolute activity numbers. The 
mechanism takes user input from a Cluster Control worksheet, e.g., as shown in 
Figures IIA-IIB. Figure IIB shows a control panel from the cluster control 
worksheet shown in Figure 11 A, wherein the user has selected colvunns F to I of 
worksheet "DEMO 1" for scoring dose-response. 
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The system detects uncolored data, notifies the user, and asks whether to 
contmue. If yes, the system skips the row containing the uncolored data. The 
system inserts three new columns on the original spreadsheet to contain the new 
scores, the new colimins immediately following the colimms of dose-response 
data. The column headings show the name of the parameter set used for scoring. 
Preferably, the system offers to regenerate existing table of Sheet Statistics to 
correct it for newly added score columns. Further, the system offers to sort the 
data rows by decreasing score. The system also offers to carry out quantitative 
estimates of ICso values for the user's compounds, by adding artificial calculated 
calibration marker compounds. 

In order to score and sort dose-response data: 

(1) Ensure that the dose-dependent activity data columns are ordered 
with highest concentration at the left and lowest concentration at 
the right. To ensure this, the system will remind the user of this 
requirement and ask him to confirm it when this tool is run. Note: 
If the data are for an undesired effect such as toxicity, the columns 
should be ordered the opposite way (lowest concentration left, 
highest right). 

(2) Use the Data Coloring (described above) to color the dose-response 
data columns. 

(3) Go to the Cluster Control sheet 314 (Figures IIA-IID) and pick 
a control panel 1100 to use. On this sheet, a single control panel 
extends vertically through the black, blue, red, and green sections, 
and provides input information for several tools. 
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(4) In the top black section, give this new parameter set a name (1102) 
if not already done. The name will be used to label the outputs. 

(5) In the blue section, enter a sheet name (1104). 

(6) In the red section (1110), enter the colors used to color the dose- 
dependent data, and relative point scores (1112) to be assigned to 
these colors. 

(7) In the green section (1118), enter the columns which contain the 
dose-response data (using the same syntax as for Data Coloring). 

(8) Click the green-text "Score Dose-Response" button (1120). 

(9) If the data are expressed as "percent of maximal effect," the user 
can follow the prompts to add calibration markers and make 
quantitative estimates of ICso's. 



Note on the output of score and sort dose-response: the system inserts 
three score columns after the dose-dependent data. The three scores are all scaled 
to a 0-100 range, and have meanings as follows: 

(a) degree of dose-responsiveness over the concentration range 
tested: 

100 = smoothly decreasing with dilution, spanning the 

entire range of color groups; 
75 = flat dose-response; and 
<75 = even more poorly behaved 

(b) overall activity level demonstrated in the dose-response data 

100 = highest activity color group at all concentrations. 
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(c) a variably weighted composite of (a) and (b), designed to give 
high scores for high activity even when dose-responsiveness is 
poor (e.g. a compound that is highly active at all 
concentrations). 



The Dose-Responsiveness Scoring Algorithm 

The data columns are ordered left to right, by decreasing concentration. 
The scoring algorithm awards positive score points for each dilution step across 
the data that actually shows a decrease in the activity data group (i.e., the color), 
and to penalize every step that does not. The algorithm uses the following 
scoring: 

• +1 point (awarded) when a dilution step moves to a lower activity group 

• 0 points when a dilution step leaves the activity group unchanged 

• -3 points (penalty) when a dilution step moves to a higher activity group 
The maximum score would go to a compound that shows all the possible 

color group steps in the right direction, and has no reversals. The minimum score 
would go to a compound with all the possible reversals, and no correct steps. The 
program then scales the extremes to 100 and 0, in order to present a consistent 
interface to the user. 

The relative magnitudes of the scoring parameters were empirically arrived 
at by testing "complete sets" of color patterns. This is possible because of the data 
simplification afforded by the value grouping. If we define the following 
numerical parameters: 

C = number of colors used, i.e., number of data value groups 

45 



wo 01/08039 PCT/USOO/20401 

P s number of points measured, i.e., number of different concentrations 
(doses) tested, 

then the entire ^Hmiverse" of possible color patterns includes (C^) different cases. 
For some typical values that might be encountered in real HTS data, this total 
number of cases is manageable in Excel, as shown by Table 1, below. 



Table 1. Total Number of Color Patterns 



p = 


C = 


total number 


# of cone. 


# of color groups 


of 


Points 




possible cases 


3 


3 


27 


3 


4 


64 


4 


3 


81 


4 


4 


256 


5 


3 


243 


5 


4 


1024 


6 


3 


729 


6 


4 


4096 


7 


3 


2187 


7 


4 


16384* 



* For a spreadsheet with a heading row, this exceeds ExCEL's current limit (for 
EXCEL 95) by one. This value should not exceed the limit for Excel 97. 

Scoring was done on several of these complete sets. In each set, the results 
were sorted by decreasing score and compared to "intuition" for general 
correctness of ordering of dose-responsiveness, and scanned for cases deemed to 
be clearly out of order. The (+1, -3) score set w£is foimd to produce satisfying 
ordering, while lesser penalties led to poorly ordered results. More objective tests 
of ordering (described below) were then used to further validate the algorithm 

The case of P=3 and C=3 is presented below in its entirety for illustration. 
Table 2 (Figure 16F) shows artificial data and processing for twenty seven (27) 
hypothetical compounds. The "percent inhibition" colimms represent assay 
"data." If one defines three groups by breakpoints at 33% and 66%, each cell is 
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assigned to a data group as shown in the middle set of three columns. Here it is 
clear that the order of compounds in this table is systematic (111,112, etc.)> to 
illustrate that the complete set is present. The third set of three columns shows 
color coding, with the darkest being least active and the lightest being most active. 

Then the data set was processed by the system to yield dose- 
responsiveness scores, and the results sorted by this score, giving Table 3 
(Figure 16G), the complete set in order of decreasing dose-responsiveness. 
Table 3 also shows the intermediate step-scoring and xmscaled score points, to 
aid in following and imderstanding the algorithm. These points are not displayed 
by the system itself. 



The Overall-Activity Scoring Algorithm 

The second property of interest to be extracted from the data is the overall 
activity level exhibited by each compound. As explained above, this is largely 
independent of the dose-responsiveness. 

The data value groups' ordinal index nimibers are used as single-point 
activity measures instead of the original data numbers. Extra weight is given to 
activity shown at lower concentrations by the simple algorithm of weighting each 
data column by its serial position, again ignoring the actual concentration values. 
The scores are then scaled to the range 0 to 100. The results of this scoring on the 
same complete set are shown in Table 4 (Figure 16H) which has been re-sorted 
by decreasing overall activity. 
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Comparison of Tables 3 and 4 (FiGUR£S 16G & 16H) shows clearly that 
the compound ordering by dose-responsiveness is quite different from the ordering 
by overall activity. The user (a chemist) could now color-code the new score 
columns and use them as independent factors in a larger scoring. However, 
chemists also want a single index of compound quality derived from the dose- 
dependent data. Moreover, a composite index would fiirther help to alleviate the 
effects of noise on data interpretation, by incorporating more information into the 
ordering process. This is an "information-based smoothing" of the data. 
Therefore, a procedure to calculate a third, "smart composite" score from the other 
two scores was devised. 

The general idea is that when selecting good compounds from dose- 
response data, compounds showing overall high activity should not be discarded 
for lack of responsiveness. Therefore, the smart composite score should give 
more weight to the overall activity when the overall activity is high, but lower 
weight when it is low. A generalized weighted average is written as 

composite score = (activity weight) (activity score) + 

(responsiveness weight) (responsiveness score) 
or, defining corresponding symbols: 

Sc = (Wa)(Sa) + (Wr)(Sr) 
If the weights are normalized to sum to unity, then this becomes 

Sc = (Wa)(Sa) + (1-Wa)(Sr) 

The activity weight Wa varies with the activity score Sa in such a way as 
to achieve the desired result. 
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The functional form of this variation was the subject of empirical testing. 
It was decided that the limits would be that Wa would approach 0.5 (activity and 
responsiveness equally weighted) in the limit of low activity, and that Wa would 
approach 1 .0 (responsiveness ignored) in the limit of high activity. The actual 
variation was encoded as an exponential increase in order to have rather sharp 
onset of the activity bias at higher activities: 

WA = (C,)exp [(k)(SA)] + C2 

The value of the coefficient k=0.06, for which the activity bias starts to 
become substantial around an activity score of eighty (80), was chosen for 
implementation in a preferred embodiment of this invention, according to 
empirical results. Figure 161 shows the variation for a few values of k. Table 5 
(Figure 16 J) shows all three scores for the example complete set, now sorted by 
decreasing composite score. 

The details of the scoring algorithms were arrived at largely by comparing 
results to intuitive ordering of the test cases in the complete sets. Because the sets 
were complete, no really new results can be generated by further test sets. 
However, one can generate test activity data sets from compounds of known 
potencies, whose real rank ordering is thus known, in order to see more 
objectively how well the scoring algorithms rank the results. 

To this end, a set of pseudo-ligands was hypothesized, with dissociation 

constants from a fictitious receptor ranging from nanomolar to millimolar (pK = 9 

to 3). The set included thirty one (31) compoimds, with potencies evenly spaced 

by 0.2 log units (9.0, 8.8, 8.6, ... , 3.4, 3.2, 3.0). A "pseudo-screen" was created 

which "tested" binding of these ligands at five concentration points in the usual 
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range: lO"^ M, 3x10^ M, 10"^ M, 3x10"^ M, and 10*^ M. Note that the span of 
potencies exceeds the span of concentrations tested by two log units on each end, 
so the test set includes both "very active" and "very inactive" compounds relative 
to the screening concentrations. 

Then artificial binding data were created by calculation as follows. 
Assuming a simple binding equilibrium of the ligand to a receptor, the "percent 
inhibition" at a given ligand concentration is equal to the fraction of receptor sites 
which are occupied by ligand, given by simple equilibrium equations as 

Pinhib = 100 • (ligand) / [K + (ligand)] 



For a more realistic simulation, artificial random noise was then added to 
the calculated numbers. The first experiment reported below used noise randomly 
distributed over the range of ±10 inhibition percentage points, and the second 
with noise up to +30 inhibition percentage points. Note that this means ten or 
thirty percentage points of absolute error, not 10% or 30% of the value. 

The artificial data were then color-coded according to the mechanisms of 

this invention (described above) into four color groups, using the simple 

breakpoints at 25, 50 and 75 percent inhibition. Note that in assigning these 

breakpoints, no consideration was given to the actual data values. Then the 

scoring algorithms of this invention were run, and the compounds sorted by the 

composite score. Rank order numbers were assigned to the compounds, with I 

being the most potent and 3 1 the least. In cases of ties in the composite score, 

equal rank niunbers were assigned, with a value equal to the average of the rank 

numbers speinned by the tied group of compounds (e.g., a tie for 2nd and 3rd 
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would result in each compound being ranked "2.5"). For each experiment, the 
final rankings were plotted against the "real" rankings by known potency, to test 
how well the scoring algorithms ordered the compounds. These plots are shown 
in Figure 16K (for noise=10) and Figure 16L (for noise=30). 

For the experiment with noise up to 10 inhibition percentage points, shown 
in Figure 16K, the ranking of the composite scores is "perfect" (in the sense of 
having no inversions) over the range of tested concentrations (pK = 5 to 7). The 
pseudo-screen is unable to distinguish the potencies of compounds above or below 
this range. 

When the noise is much higher (30 percentage points), the ranking of 
individual compounds is not as precise, but one can identify three cleanly divided 
"good-medium-bad" groups, as indicated by the dashed boxes on Figure 16L. 
Thus, even with this rather extreme noise level, the invention's scoring still 
successfiilly prioritizes the compoimds into groups. The range where 
discrimination is effective is still roughly the range of the test concentrations (pK 
= 5 to 7), but has been reduced somewhat by the higher noise. Note that the 
ranking within this range (the middle boxed group) is still mostly correct, with 
only one inversion, even for single compounds. 

Quantitative Estimation of Potencies 

With confidence established that the algorithms provide reliable rankings 
of compounds by potency, it is possible to proceed to making quantitative 
estimates. The method uses calibration marker compounds. 

To understand this method, it is helpful to realize that the concept is 
analogous to the quantitative use of SDS polyacrylamide electrophoresis gels to 
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measure protein molecular weights. The proteins are known to migrate through 
the gel with speeds directiy dependent on molecular weight, but it is difficult to 
calculate the absolute migration rates for a particular experiment. In dose- 
response scoring, the compounds are known to be properly ordered, but it is not 
clear how to calculate a potency (e.g., K^jjss or ICso) directly from the score. 

Protein chemists solve the molecular weight problem by running marker 
proteins, with know molecular weights, in the same gel, then using their band 
positions as calibration for the unknowns. Analogously, this invention's 
quantitative estimation method uses hypothetical marker compounds of know 
potency to internally calibrate the dose-response composite scores for the user's 
choice of a coloring rule, then interpolates the potencies of the imknowns. 

To create markers, the system asks the user to input the concentrations 
used for each of the dose-dependent activity data columns. The system then picks 
a set of calibration concentrations, at intervals of 0.5 log units, to span the tested 
range. For each of these calibration concentrations, a marker compound is created 
and added to the user*s compound list, and artificial data is calculated for each 
column, from the same simple equilibrium binding equation used above in the 
validation study (this time with no "noise"): 

Pinhib = 100 • (ligand) / [K + (ligand)] 

The marker data are then colored by the same rule used for the user's 

compounds, and the scoring and sorting algorithm is re-run. 

The result is that the markers are sorted into the list according to their 

potencies, and the potencies of the other compounds can be estimated by 
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interpolating between the markers, using the composite dose-response scores. To 
illustrate, a typical section of a sorted list is shown below in Table 6 
(Figure 16M), using four colors. 

Potencies for compounds that fall between two markers are calculated by 
linear interpolation between the logarithms of the markers. Given the various 
uncertainties in the data values themselves and in the evaluation process, it was 
found that linear interpolation between markers spaced at 0.5 log unit intervals 
was sufficiently precise, and no more complex curve fitting was necessary. 

Validation of Quantitative Estimation 

Validation of the quantitative estimation method followed a procedure 
very similar to that used to validate the scoring, and using the same sets of test 
data with various noise levels. As before, the testing concentrations were from 10" 
^ to 10'^ M (negative log from 5 to 7). Marker compounds (no noise) were added 
with pK*s from 4.5 to 7.5, and K^iss estimates for the noisy compounds were 
carried out by the interpolation method. The results are shown below for the cases 
of 10 and 30 inhibition percentage points of noise. 

Figures 16N and 16P show that the estimates are clearly quite good 
within the range of the testing concentrations (pK 5 to 7), but the quality of 
estimation deteriorates quickly beyond those limits, and algorithm does not 
reliably distinguish among compounds whose potencies are more than a half log 
unit beyond the testing range. Therefore, it was decided that presently preferred 
embodiments would not report any estimated values that fell outside the range of 
concentrations iised in the testing data columns. Thus, in the example in Table 6 
(Figure 16M), the lowest testing concentration was 10'^ M (= 0.1 ^iM). For the 
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first compound in Table 6, the system has estimated a potency with pICso > 7, 
but it conservatively only reports "<0.1 /iM." 



Table 7 summarizes the statistics of the estimations v^thin the testing 
limits. Table 7 shows that the method successfully estimates the potencies within 
about a factor of two, even with high noise levels. 

Table 7, Statistics of Estimation Validations 



percentage 


number of 


average of 


Average error 


points 


compounds 


abs(log error) 


in IC50 


of noise 






(factor) 


10 


13 


0.22 


X 1.6 


30 


13 


0.39 


x2.5 



Comparison to Other Methods of Quantitative Estimation 

Further corroboration was obtained by treating some real data from T-cell 
proliferation blockage assays. It is estimated that these data have at least as much 
noise as the artificial test set with 30 inhibition percentage points added. The 
standard treatment of this data in the past has been to fit a dose-response curve 
with a Hill coefficient of 1 , using a PC-based program ORIGIN. (ORIGIN is a 
data analysis program from Microcal Software, Inc. of Northampton, 
Massachusetts. ORIGIN is used in this instance for non-linear least-squares 
fitting of dose-response curves to fiinctional equations.) 

The data used here were from testing in the concentration range from 1 to 
0.03 ^ (negative log from 6 to 7.5). The plot in Figure 16Q shows the 
correlation of values estimated by this invention with values from ORIGIN fits. 
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Two compounds that the present invention estimates to be beyond the testing 
range, i.e., pICso below 6, are included as open diamonds, for illustrative purposes 
explained below. (As explained above, preferred embodiments of this invention 
would normally not report these values.) 

The results are consistent with the properties observed in the validation 
study. The present invention estimates are quite good within the testing range (6 
to 7.5). At the lower limit, this invention has made two estimations exactly at 10*^ 
M (arrows) which do not correlate as well with the ORIGIN fits. Nevertheless, 
because the whole plot spans only a relatively narrow range of potency, even these 
discrepancies are not very large. For all twenty eight (28) estimates within the 
testing range (including these two), the average logarithmic deviation between the 
two estimation methods is 0. 1 8, corresponding to a factor of only 1 .5. 

It is further noted that the calibration marker estimation method does not 
uniformly "flatten out" beyond the testing concentration range. The two open 
diamonds in FIGURE 16Q are estimations that presently preferred embodiments of 
this invention would not normally report because they have pICso < 6, but they 
agree well with ORIGIN fits 

It is interesting to compare calibration-marker with curve-fitting results for 
particularly badly behaved data, such as dose-response curves that are not 
monotonic with respect to concentration. This is sometimes the type of data that 
emerges from dose-dependent screening in a high-throughput mode. Studies of 
this type have been initiated by adding artificial noise to the extent of fifty (50) 
inhibition percentage points. 

Finally, it should be pointed out that there is some mechanical advantage 
to using the present invention relative to current practice of using ORIGIN. 
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ORIGIN is used by manually filling in a template with data, then manually 
executing a fit. Depending on the number of points and the degree of 
customization of parameters, this can take one to ten minutes of the user's time. 
The present invention, on the other hand, processes a whole spreadsheet at once 
(i.e., up to 16,383 compounds), and goes at a rate of about 3,000-4,000 
compounds per minute on a 200 MHz PC. 

Examples 

Figure 16A shows dose response data for twenty (20) compounds at four 
concentrations. The data have been grouped and the cells colored by the rule 
shown in Figure 16B. The result of the scoring and sorting process is shown in 
Figure 16C, where the compounds are ordered by decreasing values of the 
composite score (column H). Then, virtual "marker" compounds are added with 
known potencies spaced by 0.5 log units, and they are shown in Figure 16D, 
colored by the same rule and scored. The name of each marker compound 
designates the logarithm of its potency, e.g., "marker_7.0" has a potency IC50 = 
10'^ M. Figure 16E shows the result of sorting the list by decreasing composite 
score after adding the markers. This process then enables estimation of IC50 
values for the compoimds by interpolating in the column (H) of ordered composite 
scores, and these estimates appear in two forms in columns I and J. 

6. Summarize Spreadsheet Statistics Mechanism 

This mechanism creates a table summarizing the entries in each column of 

a data sheet, to aid the user in deciding how to color each column. The 
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mechanism counts numeric, text, and data entries, and uses color to flag coiiunns 
that have mixed types. The mechanism also coimts blanks, and specially flags 
columns with "trailing blanks," i.e., columns which are shorter than the longest 
one on the spreadsheet. For numeric data, the mechanism calculates minimum, 
maximum, mean, and standard deviation, even in the presence of interspersed text 
entries. For text data, the mechanism presents a list of the text strings used and 
their occurrence counts. The mechanism creates a summary key of the column 
letters and headings as a text box that can be copied to other sheets for convenient 
reference. 

Figure 17A shows a sample spreadsheet containing miscellaneous data 
on twenty four (24) compoimds. Figure 17B is the statistics sheet calculated 
from it. Each row of the statistics sheet describes one column of the original data 
sheet. First, the counts of numeric, text, date, and blank entries are listed, 
followed by two colimms describing the total length of the data sheet. Then the 
minimum, maximum, mean, and standard deviation of any nimieric data are 
reported. Finally, the statistics sheet lists a summary of the text strings foimd in 
each original data column. As examples, in Figure 17B, one can see that 
original column A ("Cmpd") had twenty four (24) different text strings, that the 
numeric data in original column C ("Testl") had a mean of 2.385, and, flagged by 
the red coloring, that original data column E ("Test3") had a mixture of ten 
numeric data and two text strings, both "N.A." 

The details of how the program code accomplishes this are 
straightforward, and one of ordinary skill in the art would know, from this 
description, including the Figures, how to make and use this invention. The 
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program loops through all the entries in the column, testing the data type of each, 
and tallying the counts and nimierical statistics. 



Spreadsheet Creation and Organization 

The operations of this invention require a considerable amount of user 
input, e.g., to create well-structured spreadsheets, to define and apply diverse 
coloring rules for large numbers of columns, and to use these colors and the user's 
stated scientific priorities to create meaningfully ordered lists of compounds or 
clusters. 

The user interface of this invention has been designed to ease this process 
and help the scientist focus on the tasks of formulating and recording clear 
descriptions of the evaluation parameters. Accordingly, this invention provides a 
number of tools and mechanisms to aid in the creation and organization of 
spreadsheets. These tools and mechanisms include: 

• Smart Append Colunm Mechanism 

• Merge Data Mechanism 

• Data Import Mechanism 

• Workbook Navigation Shortcuts 

• Conversion of "uM" to ^iM and "UU" to 

• Delete Pictures Mechanism 

• Change Values in Column Mechanism 

• Concatenate Values across Columns Mechanism 

• Delete Leading Inequality Signs Mechanism 

• Delete Derived Sheets Mechanism 
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Smart Append Column 

This mechanism appends new columns of data onto an existing 
spreadsheet, matching rows by labels (e.g., compound numbers). The mechanism 
copies all data to a new sheet before doing its work, leaving the original sheets 
unchanged. There is no need for the user to pre-sort any of the data. The 
mechanism provides optional case-sensitive or case-insensitive label matching. 

New rows are added at the bottom when new labels do not match any old 
labels. Rows with missing labels are identified and the system offers to fill them 
by copying previous label. Rows with repeated labels (i.e., replicate data) are also 
identified and the system offers a choice from among several automated 
processing rules, or manual fixing. A fast matching algorithm temporarily sorts 
rows by label, then restores original order when finished. Several intermediate 
stopping points are offered and extra data viewing options for conservative users 
worried about errors. 



Merge Data Mechanism 

The merge data mechanism copies new data values from an appended 
column into an older column. The mechanism copies all data to a new sheet 
before doing its work, leaving the original sheets unchanged. The mechanism 
detects cells where new data would overwrite old data that is different, flags them 
with color, and alerts the user. Several intermediate stopping points are offered to 



59 



wo 01/08039 PCT/USOO/20401 

the user, as are extra data viewing options are offered for conservative users 
worried about errors. 



Data Import 

One-button (or one-menu-click) import of existing Excel spreadsheets 
into an integrated file, which contains both the data and the related control sheets. 
The mechanism offers to search for and remove any leading or trailing spaces in 
the imported data and offers to consolidate replicate data rows into unique ones, 
using user choices as to how to handle the replicate data. The mechanism also 
detects hidden rows and offers to unhide them and detects formulas and offers to 
convert them to values. This mechanism is also used to update to newer version 
of the system. 

Workbook Navigation Shortcuts 

The system includes various workbook navigation shortcuts including: 

• A special added drop-down menu which includes commands for jumping 
directly to the various control sheets. These commands also have keyboard 
shortcuts assigned to them. 

• From a cell on a control sheet that contains the name of a data sheet, a 
special item on the right-click shortcut menu jumps directly to that data 
sheet. Other special items on this menu enable a "Twin Screen" display to 
see two sheets at once. 
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• To aid in choosing columns to enter on control sheets, there is a special 
"Twin Screen" display triggered by right-clicking any cell on a control 
sheet that contains the name of a data sheet. 

Convert "uM" to uM and "UU" to a 

Preferred embodiments of the system of this invention require the data 
spreadsheet to have one and only one row of column headings. The user can type 
either of the encoded strings '*uM" (lowercEise u, uppercase M) or "UU" (both 
uppercase) into any column heading, select the cell or whole row of headings, then 
pick this command. Each "uM" in the selection will be converted to "piM", and 
each "UU" will be converted to a "ji". The code recognizes the special exception 
of the word "VACUUM" as long as it doesn't end with the cases "uM." This 
conversion allows the user to avoid the confusing use of lowercase "u" or the 
column-widening use of the full prefix "micro." This utility appears on the 
system menu. 

Delete Pictures 

The system provides a mechanism for removing pictures containing 
chemical structures, in order to reduce file size, processing time, and confusion 
when they do not align properly after row sorting. 
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Change Values in Column 

This is a mechanism for regularizing data in a spreadsheet colunm. It 
facilitates replacement of all occurrences of a given value by another. The 
mechanism creates backup copies of the original column, and updates any existing 
data statistics for the edited sheet. 



Concatenate Values across Columns 

The system provides a mechanism for regularizing data in a spreadsheet 
colunm. Some possible uses include: (a) construction of unique row labels: M- 
number plus stroke number -> "Ml 23456/001"; and (b) reconstitution of 
numerical inequalities from separate columns: plus a number -> ">number". 
The user is provided with an option to include linking (delimiting) text strings 
between values and an option to include or skip blanks. The system retains the 
original columns and inserts a new one for the results. 

Delete Leading Inequality Signs 

Another mechanism for regularizing data in a spreadsheet column includes 
the mechanism to delete leading inequality signs. This mechanism converts 
entries like ">1000" to just the number "1000". This must be used with 
considerable caution, because it is the equivalent of creating a false test result. It 
is generally preferable to color the cells containing text strings with the data 
coloring mechanism described above, rather than alter them. All later processing 
is based on the colors, not the cell values. This mechanism also deletes inequality 
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sign only if it is the first character in the cell, 
copies of the original columns. 
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The mechanism creates backup 



Delete Derived Sheets 

The system menu includes a command to delete all output sheets from the 
current workbook, with separate user confirmation for each one. This is intended 
as a cleanup mechanism for information that may be outdated and is easily 
regenerated by subsequent system runs. 

Initial experience with the coloring tool has revealed that color coding has 
more subtle, but far-reaching usefulness. The colors themselves also can act as a 
"currency of exchange," a medium for comparing the quality of one kind of result 
to the quality of a very different kind of result. For example, an HTS activity of 
"95% inhibition" may be considered desirable and color coded, e.g., green. In the 
same list of compounds, a molecular weight between 400 and 600 may be consid- 
ered optimally desirable, and thus also color-coded green. If the user takes care 
when assigning colors, "green" takes on a common meaning across the board. 
This translation of data values into colors then opens up a cornucopia of 
possibilities for processing the colors (as numerical color indices) and comparing 
compoimds, searching, in our example, for the ones that are the "most green." 

Accordingly, the system includes tools to numerically score individual 
compoimds or clusters of compoimds by the colors that appear in their various 
data columns. The system can then create a new spreadsheet sorted by this score 



63 



wo 01/08039 PCT/USOO/20401 

(either by single compounds, or cluster-by-cluster, the choice being the user's), in 
which the "most green" compounds will then appear at the top. 



Examples 

Application to Portfolio Management 

The system can equally well be applied to any set of data where the rows 
are cases of a similar construct, with the columns being various properties of each 
case. For example, a data spreadsheet can contain a list of competing projects or 
investments for a company's portfolio, with the columns containing various 
managers' ratings of each project or investment. Figure 18A shows an example 
of twenty projects, each of which has been scored 1, 2, or 3 on two factors, one 
more important than the other, by each of three managers. The sheet has been 
colored by the rule shown in FIGURE 18B. Then the data were scored and sorted 
by the sorting rule of Figure 18C, and the result is shown in FIGURE 18D, 
Clearly, the projects that were given a "3" in the important factor come to the top, 
and it can be seen that the less important factor does indeed matter less to the final 
ordering. The colors also help to flag anomalies, such as a low score by one 
manager on an otherwise high-ranking project. 

In general, the data can be various sorts of data. Some examples are listed 
below and illustrated in the referenced Figures. 

Figure 19 shows a list of drug candidate compounds, scored and sorted 

by a composite of ten parameters that describe their physical, chemical, and 

biological properties. Green shades indicated desirable values; red shades are 
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undesirable. The display is compressed vertically with the vertical re-scaling tool 
to clearly display the difference in coloring patterns between the top eighty (80) 
compounds and the bottom eighty (80) compounds (separated in the illustration by 
a blank band). 

Figure 20 shows a list of proteins that are candidates for targets for drugs, 
chosen from a pool of candidate genes, scored and sorted by a composite of eleven 
parameters that describe their suitability. 

Figure 21 shows a list of research projects competing for resources. Each 
project has been scored according to several evaluation factors, and the whole 
array has been sorted by color groups. The same construct is useful for evaluating 
employee performance or job candidates. 

Figure 22 shows a list of pharmaceutical companies and their current 
statiis with regard to discovering or marketing products in each of various disease 
areas. Each company's line of products has been scored according to the matvirity 
of the offerings, and the whole array has been sorted by color groups. 

Figure 23 shows the use of data-grouping (coloring) rules to visualize the 
time courses of drug concentrations in blood. In this example, light colors were 
chosen to represent high concentrations of drug in the blood, while dark colors 
were chosen to represent low concentrations. The figure shows a wide range of 
differing time courses. 

Figure 24 shows the use of data*grouping (coloring) rules to visualize the 
matrix of pairwise cross-correlations of the results of eight (80) drug screens. In 
this example, light colors were chosen to represent low correlations, while dark 
colors represent high correlations. 
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Quantitation of the Similarity of Data Grouping in Two Variables 

As part of the present invention, a mechanism is provided for assigning a 
quantitative measure to the degree of similarity of grouping (visualized by color 
coding) of data in each of two columns of an Excel spreadsheet. The mechanism 
allows for a correlation-like analysis on a wide variety of data types, including 
text, or mixed numbers and text. 

In the data-exploration paradigm of the present invention, one of the first 
steps a user takes is to divide the range of data values in each colimm into a small 
number of groups for further analysis, thus effecting a reduction of precision 
which has been found to be useful in various ways. 

It is sometimes useful to explore whether the rows of the data matrix have 
been divided into similar groupings in each of two different colxmms. For 
example, a researcher might ask, "Do the high molecular weight compounds tend 
to be the ones whose solubilities fall below the limits of measurement?" In other 
words, this would mean to compare the groupings in the molecular weight colunm 
with the groupings in the solubility column. 

If the data were strictly quantitative, this would be called correlation of 
variables, and there exist a number of perfectly good statistical measures of the 
phenomenon. However, one of the unique capabilities of the present invention 
lies in dealing with textual data and mixtures of numbers and text, and it would be 
helpful if one could translate the visible color patterns of the present invention to 
some kind of quantitative measure of correlation. In order to avoid confusion with 
standard statistical correlation, the distinct term "color grouping similarity" is used 
to describe the new measure. 
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In preferred embodiments of this invention, the data grouping is stored in 
the form of the colored backgrounds of data cells. At first glance, one might 
simply seek to compare the colors of the first column with those of the second, 
and count the number of rows with matching colors. However, a color grouping 
similarity tool must be able to cope with the possibility that the colors are 
different. This could happen because the user chooses completely different color 
schemes for the two columns, or because the correlation is negative. As an 
example of negative correlation, suppose column A contains random numbers 
between 0 and 1, colored such that those above 0.5 are green and those below 0.5 
are red. Then imagine a colxmm B where each value is equal to one minus the 
corresponding value in column A, i.e., the "one's complement." If the user colors 
the second column with exactly the same coloring rule as the first, every row will 
have a different color in column B than in column A. None of the colors will 
match, though the groupings are perfectly correlated. To be successful, the tool 
must deliver a high measiu^e of correlation between such pairs of columns. The 
algorithm described below was designed to perform in this way. 

Algorithm for Measuring Data Grouping Similarity 
The algorithm was derived from semi-quantitative reasoning, as follows. 
It is based on the qualitative question, "For all rows that have one particular color 
in the first column, to what degree do they have a uniform color in the second 
column (not necessarily the same color as in the first column)?" The quantified 
answer to this question is then averaged over the set of colors used. 
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The details of the mechanism can be seen by example. First, to compare 
the grouping in two columns A and B, a matrix of "ordered color pair counts" 
(OCPC) is defined such that each matrix element OCPCy is the coimt of 
spreadsheet rows where one finds color / in spreadsheet column A and color j in 
spreadsheet column B. Then, the rows of the two spreadsheet columns are 
scanned to count the number of occurrences of each ordered color pair and thus to 
determine the values of the matrix elements. 

In the discussion below, carefully distinguish the rows and colunms of the 
user's spreadsheet of data from the rows and colunms of the derived OCPC 
matrix. 

If the two spreadsheet columns had exactly the same colorings the nonzero 
elements of the OCPC matrix would all be on the diagonal. As a simplified 
example, consider four colors (green, yellow, orange, red) and a total of 16 data 
rows. The diagonal matrix might be (zero elements left blank for emphasis) 



color in column A color in column B — > 


green 


yellow 


orange 


red 


green 


4 








yellow 




2 






orange 






7 




red 








3 



In the case above, there are groups of 4 spreadsheet rows colored green (in 
both columns), 2 rows colored yellow, 7 rows colored orange, and 3 rows colored 
red. 

In contrast, if the groupings were the same, but the coloring rule for the 
second spreadsheet column used the same colors in a different order, the OCPC 
matrix might look like the following, no longer diagonal: 
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color in column A color in column B 


green 


yellow 


orange 


red 


green 






4 




yellow 


2 








orange 








7 


red 




3 







A simple extension applies if entirely different colors (cyan, blue, maroon, 
purple) are used in the second spreadsheet column. The OCPC matrix might then 
be: 



color in column A color in column B 


cyan 


blue 


maroon 


purple 


green 






4 




yellow 


2 








orange 








7 


red 




3 







In any of the three cases above, the groupings are identical, and the OCPC 
matrices have the property that each matrix row and matrix colunm has only one 
nonzero element. That lone element is necessarily equal to the sum of the row or 
column. This situation should receive the highest similarity score. 

One way to define the contrasting situation that would deserves the lowest 
similarity score would be that for each user-defined group of spreadsheet rows in 
one spreadsheet coliunn, the other spreadsheet colimin has a "maximally non- 
unique^' set of colors. In the corresponding OCPC matrix, this corresponds to 
each matrix row or column having a broad distribution of values rather than a 
single non-zero, a uniform distribution has been chosen as the definition of this 
state: 
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color in column A color in column B — ► 


cyan 


blue 


maroon 


purple 


green 


1 


1 


1 


1 


yellow 


1 


1 


1 


1 


orange 


1 


1 


1 


1 


red 


1 


1 


1 


1 



This low-similarity state can be more precisely defined by saying that each 
element in a given matrix row or column is the average of all the counts in that 
matrix row or colunm. 

With these concepts defined and the OCPC matrix filled, the scores can 
then be derived. Each OCPC matrix row (corresponding to a color group in the 
first spreadsheet column) is selected in turn for scoring. Each element in the 
matrix row is given a score between zero and one, according to its linear 
interpolation between: on the one extreme, the average of the nonzero elements in 
the row, and on the other, the sxmi of the row or column (i.e., the maximum value 
it could have if all the others were zero). The scores are then averaged over all the 
rows of the OCPC matrix to generate a row-wise score component. 

Next, the corresponding process is applied to the columns of the OCPC 
matrix (each corresponding to a color group in the second spreadsheet column 
rather than the first). The resulting column-wise score component is averaged 
with the row-wise score component, then the average is scaled to a maximum of 
100 to generate the final similarity score for the two spreadsheet columns. 

Interpretation of the Similarity Scores 

Although the scores are quantitative and well-defined, their interpretation 
is best done in a partly subjective manner, based on experience. The behavior of 
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the scores is best understood by example. In Figure 25, the leftmost ("base") 
column has been compared to each of the others, and the scores are shown as well 
as pictures of the grouping patterns. Comparison of the base with the next column 
shows that the tool delivers a maximal score of 100 for identical grouping, even 
5 when the colors are completely different. Then, stepping across the figure toward 

' the right, it can be seen how the score decreases as the grouping pattem gradually 
becomes less similar to that of the base column. All the way down to a similarity 
score of 40, it is still basically true that the light colors are on top and the dark on 
the bottom, with increasing "noise," but when the score falls to 20, the pattem 
10 appears to have no correspondence to that of the base. 



Implementation of the Data Grouping Similarity Tool 

In practice, the tool allows the user to choose two sets of spreadsheet data 
16 columns. The program then automatically generates all pairs containing a column 

from the first set with a column from the second set, then writes the similarity 
scores onto a newly inserted spreadsheet in the user's workbook. The output takes 
the form of a table where the degree of similarity is itself color-coded to aid the 
user in identifying significant cases. An example appears Figure 26. 



While the invention has been described vsdth reference to particular 
mechanisms (algorithms, processes and functions) and architectures, one skilled in 
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the art would realize that other mechanisms and/or architectures could be used 
while still achieving the invention. 

While embodiments of the present invention have been described with 
particular setup and initialization procedures, other setup and/or initialization 
5 procedures can be used. 

Further, while many of the operations have been shown as being 
performed in a particular order, one skilled in the art would realize that other 
orders, including some parallelization of operations, are possible and are 
considered to be within the scope of the invention. 
10 While the present invention has been described with reference to analysis 

and pattern recognition in data sets relating to chemical compounds, the methods, 
systems and devices of this invention are considered to be general constructs 
covering other, non-chemical data sets. 

Thus, are provided methods, systems and devices for analysis and pattern 
15 recognition in large, multidimensional data sets using low-resolution data 

grouping. One skilled in the art will appreciate that the present invention can be 
practiced by other than the described embodiments, which are presented for 
purposes of illustration and not limitation, and the present invention is limited 
only by the claims that follow. 
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What is claimed is: 

1 . A method of operating on data, the method comprising: 
providing at least one user-defined grouping rule for grouping the data into 

a user-definable nimiber of groups; and 

applying at least one of the grouping rules to the data. 

2. A method as in claim 1 wherein the data are provided in a table and 
wherein the at least one grouping rule applies to at least one user-selectable 
column of the table. 

3. A method as in claim 1 wherein the at least one grouping rule 
defines breakpoints corresponding to the user-definable number of groups, and 
wherein application of the at least one rule to the data divides the data into groups 
based on the breakpoints. 

4. A method as in claim 1 further comprising: 

presenting the grouped data in a manner that visually distinguishes the 

groups. 

5. A method as in claim 4 wherein the grouping rules associate colors 
with groups and wherein the presenting of the grouped data further comprises: 

coloring an aspect of the data according to the rules. 
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6. A method as in claim 4, wherein the data are in labeled columns in 
a spreadsheet, and wherein the at least one grouping rule specifies at least one 
breakpoint and a corresponding color for each at least one breakpoint, and wherein 
the presenting of the grouped data comprises: 

coloring each data item in the at least one labeled column of the data based 
on the at least one breakpoint and the corresponding color of the at least one 
breakpoint. 

7. A method as in any one of claims 3 and 6, wherein the breakpoints 
are selected from: (a) numeric values; and (b) textual values. 

8. A method as in claim 3 wherein the at least one breakpoint is 
determined automatically based on the data. 

9. A method as in claim 5 wherein the data are provided in a table, 
wherein the coloring of an aspect of the data comprises: 

coloring backgrounds of table cells according to the rules. 

10. A method as in claim 1 wherein the mmiber of groups is fewer than 
a number of possible data values, 

11. A method of operating on data, the method comprising: 
providing at least one user-defined grouping rule for grouping the data into 

a user-definable number of groups; 
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applying at least one of the grouping rules to the data to generate grouped 

data; 

providing at least one user-defined scoring rule for scoring the grouped 
data according to user-defined scores; and 

applying at least one of the scoring rules to the grouped data to score the 
grouped data. 

1 2. A method of operating on data, the method comprising: 
generating grouped data by applying to the data at least one user-defined 

grouping rule for grouping the data into a user-definable number of groups; and 

scoring the grouped data by applying to the grouped data at least one user- 
defined scoring rule for scoring the grouped data according to user-defined scores. 

13. A method according to claim 11 or 12 wherein the data comprises a 
number of parameters for each of a number of cases and the scoring rule 
comprises a scoring function of user-selectable parameters and user-defined 
weights for the selected parameters to be used in scoring the cases, wherein the 
scoring of the grouped data comprises: 

applying the function to the data to obtain a score for each case. 

14. A method according to claim 1 3, further comprising: 
sorting the scored cases by score. 

15. A method according to claim 14, wherein the scored cases are 
sorted individually. 
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16. A method according to claim 14, wherein the scored cases are 
sorted by cluster. 

1 7. A system for operating on data, the system comprising: 

a mechanism constructed and adapted to provide at least one user-defined 
grouping rule for grouping the data into a user-definable number of groups; and 

a mechanism constructed and adapted to apply at least one of the grouping 
rules to the data. 

18. A system as in claim 1 7 wherein the data are provided in a table 
and wherein the at least one grouping rule applies to at least one user-selectable 
column of the table. 

19. A system as in claim 1 7, wherein the at least one grouping rule 
defines breakpoints corresponding to the user-definable mmiber of groups, and 
wherein application of the at least one rule to the data divides the data into groups 
based on the breakpoints. 

20. A system as in claim 17, further comprising: 

a mechanism constructed and adapted to present the grouped data in a 
manner that visually distinguishes the groups. 
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21 . A system as in claim 20, wherein the grouping rules associate 
colors with groups and wherein the mechanism constructed and adapted to present 
the grouped data further comprises: 

a mechanism constructed and adapted to color an aspect of the data 
according to the rules. 

22. A system as in claim 20, wherein the data are in labeled columns in 
a spreadsheet, and wherein the at least one grouping rule specifies at least one 
breakpoint and a corresponding color for each at least one breakpoint, and wherein 
the mechanism constructed and adapted to present the grouped data comprises: 

a mechanism constructed and adapted to color each data item in the at least 
one labeled column of the data based on the at least one breakpoint and the 
corresponding color of the at least one breakpoint. 

23 . A system as in any one of claims 1 9 and 22, wherein the 
breakpoints are selected from: (a) numeric values; and (b) textual values. 

24. A system as in claim 19 further comprising: 

a mechanism constructed and adapted to determine at least one breakpoint 
automatically, based on the data. 

25. A system as in claim 21 wherein the data are provided in a table, 
wherein the mechanism constructed and adapted to color an aspect of the data 
comprises: 
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a mechanism constructed and adapted to color backgrounds of table cells 
according to the rules. 



26. A system as in claim 17 wherein the number of groups is fewer 
than a number of possible data values. 

27. A system of operating on data, the system comprising: 

a mechanism constructed and adapted to provide at least one user-defined 
grouping rule for grouping the data into a user-definable number of groups; 

a mechanism constructed and adapted to apply at least one of the grouping 
rules to the data to generate grouped data; 

a mechanism constructed and adapted to provide at least one user-defined 
scoring rule for scoring the grouped data according to user-defined scores; and 

a mechanism constructed and adapted to apply at least one of the scoring 
rules to the grouped data to score the grouped data. 

28. A system of operating on data, the system comprising: 

a mechanism constructed and adapted to generate grouped data by 
applying to the data at least one user-defined grouping rule for grouping the data 
into a user-definable number of groups; and 

a mechanism constructed and adapted to score the grouped data by 
applying to the grouped data at least one user-defined scoring rule for scoring the 
grouped data according to user-defined scores. 
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29. A system according to claim 27 or 28 wherein the data comprises a 
number of parameters for each of a number of cases and the scoring rule 
comprises a scoring function of user-selectable parameters and user-defined 
weights for the selected parameters to be used in scoring the cases, wherein the 
mechanism constructed and adapted to score of the grouped data comprises: 

a mechanism constructed and adapted to apply the function to the data to 
obtain a score for each case. 

30. A system according to claim 29, further comprising: 

a mechanism constructed and adapted to sort the scored cases by score. 

31 . A system according to claim 30, wherein the scored cases are 
sorted individually. 

32. A system according to claim 30, wherein the scored cases are 
sorted by cluster. 

33 . A computer-readable memory medium encoded with program data 
representing a computer program that can cause a computer to implement a 
method of operating on data, the method comprising: 

providing at least one user-defined grouping rule for grouping the data into 
a user-definable nimiber of groups; and 

applying at least one of the grouping rules to the data. 
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34. A medium as in claim 33 wherein the data are provided in a table 
and wherein the at least one grouping rule applies to at least one user-selectable 
column of the table. 

35. A medium as in claim 33 wherein the at least one grouping rule 
defines breakpoints corresponding to the user-definable number of groups, and 
wherein application of the at least one rule to the data divides the data into groups 
based on the breakpoints. 

36. A medium as in claim 33, wherein the method further comprises: 
presenting the grouped data in a manner that visually distinguishes the 

groups. 

37. A medium as in claim 36 wherein the grouping rules associate 
colors with groups and wherein the presenting of the grouped data further 
comprises: 

coloring an aspect of the data according to the rules. 

38. A medium as in claim 36, wherein the data are in labeled columns 
in a spreadsheet, and wherein the at least one grouping rule specifies at least one 
breakpoint and a corresponding color for each at least one breakpoint, and wherein 
the presenting of the grouped data comprises: 

coloring each data item in the at least one labeled column of the data based 
on the at least one breakpoint euid the corresponding color of the at least one 
breakpoint. 
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39. A medium as in any one of claims 35 and 38, wherein the 
breakpoints are selected from: (a) numeric values; and (b) textual values. 

40. A medium as in claim 35 wherein the at least one breakpoint is 
determined automatically based on the data. 

41 . A medium as in claim 37 wherein the data are provided in a table, 
wherein the coloring of an aspect of the data comprises: 

coloring backgrounds of table cells according to the rules. 

42. A medium as in claim 33 wherein the number of groups is fewer 
than a number of possible data values. 

43. A computer-readable memory medium encoded with program data 
representing a computer program that can cause a computer to implement a 
method of operating on data, the method comprising: 

providing at least one user-defined grouping rule for grouping the data into 
a user-definable number of groups; 

applying at least one of the grouping rules to the data to generate grouped 

data; 

providing at least one user-defined scoring rule for scoring the grouped 
data according to user-defined scores; and 

applying at least one of the scoring rules to the grouped data to score the 
grouped data. 
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44. A computer-readable memory medium encoded with program data 
representing a computer program that can cause a computer to implement a 
method of operating on data, the method comprising: 

generating grouped data by applying to the data at least one user-defined 
grouping rule for grouping the data into a user-definable number of groups; and 

scoring the grouped data by applying to the grouped data at least one user- 
defmed scoring rule for scoring the grouped data according to user-defined scores. 

45. A medium according to claim 43 or 44, wherein the data comprises 
a number of parameters for each of a number of cases and the scoring rule 
comprises a scoring function of user-selectable parameters and user-defined 
weights for the selected parameters to be used in scoring the cases, wherein the 
scoring of the grouped data comprises: 

applying the function to the data to obtain a score for each case. 

46. A medium according to claim 44, the method further comprising: 
sorting the scored cases by score. 

47. A medium according to claim 46, wherein the scored cases are 
sorted individually. 

48. A medium according to claim 46, wherein the scored cases are 
sorted by cluster. 
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breakpoints corresponding to the user-definable number of groups, and application of the at least one rule to the data divides the data 
into groups based on the breakpoints. The grouped data is presented in a manner that visually distinguishes the groups, sometimes 
by coloring an a^ct of the data according to the rules. 
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Analysis and Pattekn Recognition in Large, Multidimensional 
Data Sets Using Low-Resolution Data Grouping 

A portion of the disclosure of this patent document contains material 
which is subject to copyright protection. The copyright owner has no objection to 
the facsimile reproduction by anyone of the patent document or the patent 
disclosure, as it appears in the Patent and Trademark OflBce patent file or records, 
but otherwise reserves all copyright rights whatsoever. 

1. Field of the Invention 

This invention relates to analysis and pattern recognition of data. More 
particularly, this invention relates to methods, systems and devices and 
combinations thereof for analysis and pattern recognition in large sets of 
multidimensional data using low-resolution data grouping. 

2. Baci^round 

With the advent of computerization and the low cost of data storage and 
acquisition, people in many endeavors are now accumulating very large sets of 
data. For example, scientists in drug and chemical companies now use automation 
to perform so-called high-throughput screening ("HTS") of chemical compounds. 
HTS uses automated, relatively low-cost techniques to obtain various items of 
information about chemical compounds. The goal of using HTS is to obtain 
information about a very large number of compounds in a quick and relatively 
low-cost manner. Having accumulated a very large HTS data set, it is necessary 
to evaluate the data in order to deteraiine which, if any, of the analyzed 
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compounds warrants further investigation. However, the results of such HTS tend 
to be very large sets of multidimensional data, on the order of thousands of rows 
and dozens of columns, and so it is very difScult to make decisions just by 
looking at the data. 

5 In addition to the very large amoimts of data produced by HTS, difficulties 

in existing data handling and analysis methods include the following: 

• Data comes from very diverse sources, including HTS laboratories, 
physical measurements, bio-scientists' laboratories, various computational 

10 software programs, etc., and the different sources tend to have very diverse 

kinds of output including numbers, text, mixed data types, error notations, 
blank data, replicate data (more than one value per compound), etc. 

• Not all sources produce data on the same list of compoimds, or in the same 
order. 

15 • Some data values are misleadingly too precise, i.e., have high relative 



experimental errors or noise, and can easily be over-interpreted. 



Medicinal chemists have to weigh very different kinds of factors (for 



example, molecidar weight vs. dose-responsiveness vs. ClogP vs. 



secondary biology vs. selectivity across screens) in trying to determine 



20 



which are the best compounds or clusters of compoimds to which to 



devote further work. 
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Summary of the Invention 

This invention solves the above and other problems by providing 
automated tools to help with and speed up these data handling and analysis 
processes. These tools embody some assimiptions about how the data should be 
treated by internalizing the most generally acceptable assumptions, but leaving 
more idiosyncratic decisions to individual users. 

A central concept on which this invention is based is grouping data into a 
relatively small number of categories using low-resolution data grouping. The 
grouping is visualized by assigning colors to data groups, e.g., in spreadsheets. 
Grouping of data potentially changes the precision of the data. 

This categorization of data has several major benefits, including: 

• creating a visual means of finding data patterns; 

• beneficially blurring small variations in numerical data that are, in 
practice, excessively fine distinctions, possibly due to experimental 
error; and 

• providing, in the colors themselves, a means or "common currency" to 
evaluate candidates across a wide range of data types. 

Accordingly, in one aspect, this invention provides mechanisms to 
expedite pattern recognition in large sets of multidimensional data, such as those 
that chemists assemble when evaluating hits from high-throughput screening 
(HTS) and deciding which ones will get priority for further investigation. In 
controlled trials, this invention has reduced the time to evaluate real data sets, 
firom days of intense himian effort, which is vulnerable to errors due to volimie or 
fatigue, to a few minutes of automation with graphical presentation of results. 
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It quickly becomes obvious upon using the system that the tools also have 
value in data-handling areas other than HTS. Examples include selection and 
management of any kind of tabulated data, e.g., portfolio management for any 
kind of rated portfolios, selection of drug candidate compounds, selection and 
management of proteins that are candidates for targets for drugs, selection and 
management of research projects competing for resources, and evaluating 
employee performance or job candidates. 

The system of this invention includes a new special command menu, a set 
of graphical user interface worksheets, and action buttons to facilitate the coloring 
and color analysis processes for the user. While the central process is the data 
grouping and coloring, there are also new tools for the upstream, or pre-grouping 
and coloring processes of importing, assembling, regularizing, and characterizing 
data in a spreadsheet, and for the downstream processes of visualizing, scoring, 
comparing, and sorting large amounts of color-coded data. The data-grouping and 
spreadsheet-coloring tool is presently implemented with a flexible, powerful, and 
convenient user interface that does not require knowledge of spreadsheet macros 
or of the Visual Basic language (used for the system's implementation). 

Accordingly, this invention provides methods, systems and devices for 
operating on data. 

In one aspect, the method of this invention provides at least one user- 
defined grouping rule for grouping the data into a user-definable number of 
groups. At least one of the grouping rules is applied to the data. The data may be 
provided in a table and the grouping rule applies to at least one user-selectable 
column of the table. In some embodiments, the grouping rule defines breakpoints 
corresponding to the user-definable number of groups. Application of the rule the 
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data divides the data into groups based on the breakpoints. The method may 
include presenting the grouped data in a maimer that visually distinguishes the 
groups. In some embodiments, the grouping rules associate colors with groups 
and the grouped data is presented with an aspect of the data colored according to 
5 the rules. 

Sometimes the data are in labeled columns in a spreadsheet, and the 
grouping rule specifies at least one breakpoint and a corresponding color for each 
range defined by the breakpoint. The grouped data are presented by coloring each 
data item in one labeled colimm of the data based on the breakpoint and the 
1 0 corresponding color of the breakpoint 

The breakpoints may be numeric or textual values. In some embodiments, 
the breakpoint is determined automatically based on the data. 

Sometimes the data are provided in a table, and backgrounds of table cells 
are colored according to the rules. 
15 The number of groups may be fewer than a number of possible data values. 

In another aspect, this invention is a method of operating on data by 
providing at legist one user-defined grouping rule for grouping the data into a user- 
definable number of groups. The grouping rule is applied to the data to generate 
grouped data. At least one user-defined scoring rule is used to score grouped data 
20 according to user-defined scores. The scoring rule is applied to the grouped data 

to score the grouped data. 

In yet another aspect, this invention is a method of operating on data, in 
which data are grouped by applying to the data at least one user-defined grouping 
rule for grouping the data into a user-definable number of groups. The grouped 
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data are scored by applying to the grouped data at least one user-defined scoring 
rule for scoring the grouped data according to user-defined scores. 

In some embodiments the data can be a number of parameters for each of a 
number of cases and the scoring rule comprises a scoring function of user- 
selectable parameters and user-defined weights for the selected parameters to be 
used in scoring the cases. The scoring applies the fimction to the data to obtain a 
score for each case. Sometimes the method includes sorting the scored cases by 
score, individually or by cluster, as described below. 

The notion of clxxstering is that subsets of the various cases may be 
associated into clusters by having identical entries in any user-selected column of 
data, known as a clustering colunm. In some embodiments of the invention, the 
integrated clusters are treated by averaging the properties of all the cases which 
comprise each cluster. 

Thus, according to aspects of this invention, in order to facilitate analysis 
and pattern recognition in large, multidimensional data sets, the precision of the 
data is potentially changed (implemented, e.g., by grouping the data) and then the 
data are presented for visualization (implemented, e.g., by coloring the data). 

Brief Description of the Drawings 

This file contains at least one drawing executed in color. Copies of this 
patent with color drawings will be provided by the United States Patent and 
Trademark Office upon request and payment of the necessary fee. 

The above and other objects and advantages of the invention will be 
apparent upon consideration of the following detailed description, taken in 
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conjunction with the accompanying drawings, in which the reference characters 
refer to like parts throughout and in which: 

Figure 1 shows a typical computer system on which the present invention 
operates; 

Figure 2 shows an overview of the functionality of the present invention; 
Figures 3A-3B depict a display of data in a spreadsheet; 
Figures 4 A-4B show a color control rules worksheet according to one 
embodiment of the present invention; 

Figures 5A-5B show data coloring rules; 

Figures 6A-6C show a data coloring control panel and a flow chart of the 
data coloring process, respectively; 

Figures 7A-8B show the worksheet of Figure 3 A and 3B after various 
coloring rules in Figure 4A have been applied; 

Figures 9A, 9B, lOA, and lOB depict displays of data in spreadsheets; 

Figures llA and IIB show the form of the cluster control worksheet 
according to one embodiment of the present invention; 

Figures IIC-IID shows control panels from the cluster control 
worksheet of Figures IIA-IIB; 

Figure 12 shows the enlarguig of the cluster starts mechanism according 
to one embodiment of the present invention; 

Figures 13A-13D show the application of vertical display re-scaling 
according to one embodiment of the present invention; 

Figures 14A-14D and 15A-15B show the application of the scoring and 
sorting of clusters according to one embodiment of the present invention; 
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Figures 16A-16N, 16P and 16Q show aspects of the application of the 
dose-response scoring and estimation of potencies according to one embodiment 
of the present invention; 

Figures 17A-17B show the application of the sheet statistics tool 
according to one embodiment of the present invention; 

Figures 18A-18D show the application of the scoring and sorting of 
clusters for the purpose of project prioritization and management according to one 
embodiment of the present invention; 

Figures 19-24 show examples of the application of this invention to 
various types of data; and 

Figures 25 and 26 show application of an aspect of this invention. 

Detaiusd Description of the Presently Preferred Exemplary 

Embodiments 

Overview 

Figure 1 shows a typical computer system 100 on which the present 
invention operates. The computer system 100 includes a processor (CPU) 102 
connected to a memory system 104 and a display 106. The computer system also 
includes various input devices including a keyboard 108 and a mouse 110 or other 
pointing device. Internal storage 112 (e.g., a hard disk, a CD ROM and the like) 
and external storage 114 (such as a floppy disk, CD ROM and the like) are also 
provided. 

Various aspects of this invention are implemented as computer software 
programs or algorithms 116 which run on the computer system 100. The software 
programs 116 can reside in the internal storage 112, the external storage 114, 

8 
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and/or in the memory 104. The software programs 116 operate on data 118 which 
is provided, e.g., on the external storage 114. The software programs 116 operate 
in a standard and known manner by being executed on the processor 102 of the 
computer system 100. 

In some embodiments of the present invention, the user can create and 
modify various executable rules 120 which can operate on the data 118. For the 
sake only of explanation, the rules 120 are depicted separately fi*om the data in the 
figures. However, as explained in more detail below, some or all of the rules 120 
can be part of the data 118. 

In preferred embodiments, the computer system 100 is capable of running 
the spreadsheet program EXCEL™ 95 (hereinafter "Excel") from Microsoft 
Corporation, and the software computer programs 116 are written in Microsoft 
Corporation's Visual Basic (hereinafter ''VB") and are provided as an add-in to 
Excel. A single copy of software thus serves all data files on a particular 
machine. To conserve EXCEL resources, in some embodiments, the package self- 
installs the add-in when the user opens a data file, and un-installs the add-in when 
the last data file in memory is closed. 

In a preferred embodiment, this invention works entirely within the 
environment of Excel. Excel structures data files as workbook files which 
contain worksheets. The programs 116 of this invention consist of special Excel 
worksheets, caUed control sheets, on which input data is written by the user into 
designated labeled cells. The control sheets are part of the same Excel workbook 
file as the data. The control sheets also contain action buttons to execute the 
various procedures associated with this invention. The rules 120 are formed by 
setting various parameters in the control sheets. 
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When the workbook file is saved, the parameters (for the rules 120) are 
stored on the control sheets along with the data, and they can be modified and/or 
re-executed at any time without having to re-enter anything. The results of 
operations are automatically written as worksheets in the same workbook file, 
providing a convenient, integrated data environment in a single file. 

The system according to the present invention operates, in one aspect, in 
accordance with Figure 2. Recall that the user's aim is to perform analysis and 
pattern recognition in large, multidimensional data sets using (potentially low 
resolution) data grouping. To this end, the user and/or the system will create rules 
for coloring and presenting the data. First (at 122) a user creates and organizes the 
data 118. Various tools (discussed below) are provided to aid in the creation and 
organization of the data. Then (at 124) the user creates rules 120 for operating on 
the data 118. The rules 120 can be created before or after the data 118, rules can 
be reused for different sets of data and multiple rules can apply to the same data. 
The creation and operation of rules are discussed in greater detail below. Once the 
data 118 and the rules 120 are created, the user then selects some (or all) of the 
rules to apply to the data (at 126). Specifically, the user groups and thereby colors 
the data according to selected rules. With the data grouped and colored according 
to the rules, the user can then perform group/color-mediated data mining (at 128). 

Figures 3A-3B show views of the program of this invention in operation 
with a sample EXCEL sheet 300, denoted "DEMO 1" (302) containing data (not all 
the data in the sheet is visible). The views of Excel worksheets shown in the 
various figures and examples that follow are the views that are presented on the 
display 106 of the computer system 100. Sheets in an EXCEL workbook are 

labeled with tabs at the bottom of tiie worksheet. The data on the "DEMOl" sheet 
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300 consists of eight columns of data for each of a number of compounds. The 
compounds are denoted "Cmpdxx", where "joc" ranges jfrom "01" to the number of 
compounds. In FIGURE 3B, the last compound visible on the data sheet is 
"Cmpd58". The eight colunms are headed: 

1. "Cmpd" (column A); 

2. "Series" (column B); 

3. "Testl" (column C); 

4. 'Test2" (column D); 

5. "Tests" (column E); 

6. "HTS SPA Dose-Resp % Inhib @3xlO-6]Vr' (column F); 

7. "HTS SPA Dose-Resp % Inhib @ IxlO-6M" (column G); 

8. "HTS SPA Dose-Resp % hihib @ 3xlO-7M" (column H); and 

9. "HTS SPA Dose-Resp % Inhib @ Ixl0-7M" (column I). 

In addition to the "DEMO 1" worksheet 300, the Excel workbook shown 
in Figures 3A and 3B has seven other worksheets, denoted "DEMO 2" 304; 
"DEMO 3" 306; "clusterinfo DEMO" 308; "Append Control" 310; "Color 
Control" 312 and "Cluster Control" 314. The last three worksheets, denoted 
respectively "Append Control"; "Color Control" and "Cluster Control," contain 
various rules and controls (to be discussed below). The data in worksheets 
denoted "DEMO 1" 302; "DEMO 2" 304; "DEMO 3" 306; and "clusterinfo 
DEMO" 308 correspond to data 118 (FIGURE 1) and the controls or rules in the 
worksheets denoted "Append Control" 310; "Color Control" 312, and "Cluster 
Control" correspond to the rules 120 (FIGURE 1). 

Figures 4A-4B show a color control rules worksheet (312, denoted "color 

control") according to the present invention, as displayed on display 106 of the 
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computer system 100. The color control worksheet 312 is shown with some rules 

already in place, i.e., having values set, and other rules left blank. These rules are 

shown as examples only, and, as with any of the other types of rules, any or all of 

the rules can be set by the user. A typical data coloring rule 130 is shown in 

Figure 5A. The rule 130 has already been set up and operates on the appropriate 

data when selected by a user (using mouse 110, FIGURE 1 or some other pointing 

device) in the area 132 marked "Click here to run these". The rule 130 (as with all 

of the preferred color control rules) has foiu: parts, namely the name of the sheet 

134 containing the data on which the rule is to operate CT)EMO 1" in the example 

of Figure 5A); the colunms 136 of data of the sheet on which the rule is to 

operate ('TE" in the example of Figure 5A); the nimiber of colors 138 to be used 

by the rule 130; a number of breakpoints 140 (denoted "break 1" to "break 4" in 

the example of Figure 5A); and a corresponding number of colors 142 for each 

range defined by the breakpoints (denoted "color 1" to color 4" in the example of 

Figure 5A). Actually, as explained below, the number of breakpoints is one less 

than the nimiber of colors.. In the specific example shown in Figure 5A, the rule 

has three breakpoints of 1, 5 and 10, defining four ranges with four corresponding 

colors 142, namely light green, yellow, orange and red. Preferably the named 

colors are also depicted in the actual colors, so that, in this example, the 

background of the word "Ughtgreen" is shown in light green, the background of 

the word "yellow" is shown in yellow and so on. 

In this invention it is preferable to show data and meta-data (headings etc.) 

in color. In some embodiments, the coloring is implemented by showing a 

background area of the text representing the data in the appropriate color. 

Sometimes the actual text representing the data is shown in the appropriate color. 

12 
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In presently preferred embodiments, the font color is only changed in cases where 
necessary to improve contrast with the background color for readability. Only two 
font colors, dark (black) and light (pale gray), are used in the presently preferred 
embodiment. Combinations of both approaches can be used. For example, the 
5 background section of the word ^'yellow" is preferably shown in the color yellow. 

It is also possible to show the word itself, i.e., the font, in the color yellow, as long 
as that color is distinguishable from the background. 

The particular rule 130 shown in FIGURES 4A and 5A, operates as follows, 
when selected: 

10 In sheet "DEMO 1 " 302, in column E, values less than or equal to 1 (break 

1) are colored light green (color 1); values in the range 1 to 5 (between break 1 
and break 2) are colored yellow (color 2); values in the range 5 to 10 (break 2 to 
break 3) are colored orange (color 3); and values greater than 10 (break 3) are 
colored red (color 4). 

15 Another typical data coloring rule 130-1 from the color control sheet 312 

is shown in FIGURE 5B. The rule 130-1 is set up to operate on columns "C" and 
"D" of sheet "DEMO 1'*. The rule 130-1 uses three (3) breakpoints (breakl=0.1, 
break2=l and break3=5) defining four ranges with four (4) corresponding colors 
("lightgreen'% "yellow", "orange", and "red"). 

20 The results of applying the rule 130-1 of FIGURE 5B to the data in sheet 

"DEMO 1" (302, Figure 3) are shown in Figures 7A-7B. As can be seen from 

Figures 7A-7B, after application of the rule 130-1, all of the data in columns C 

and D of the sheet labeled "DEMO 1" has been colored according to the rule. 

Specifically, data having a value less than or equal to break 1 (0.1) have been 

25 colored light green; data values in the range between break 1 and break 2 (0.1 to 

13 
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1) have been colored yellow; data values in Hie range between break 2 and break 3 
(1 to 5) have been colored orange; and data values greater than break 3 (5) have 
been colored red. 

The results of applying all of the other color control rules shown in 
Figures 4A-4B to the data m sheet "DEMO 1" are shown in Figures 8A-8B. 
The rules can be applied individually (as shown above with respect to 
Figures 7A-7B), or they can be all be applied at the same time. In order to apply 
all rules to a particular data set (sheet), each rule can be individually selected or 
the area labeled "Re-RUN ALL RULES FOR SHEET NAMED DEMO 1" (on the right 
side of Figure 4A) can be selected. Note that if two rules apply to the same 
column of the same sheet, the second mle run on that column will override the 
first rule nm on that column. 

To create a coloring rule a user performs the following (with reference to 
Figure 6B): 

(1) Select the "Color Control" sheet 312 and pick a control panel on 
that sheet to use (an empty panel or one containing a rule no longer 
needed) (at 600). All control panels on a sheet can be cleared by 
clicking the button labeled "Clear all entries on this sheet" 
(318 in Figure 4A). 

(2) In the selected control panel, enter the name of the sheet to be 
colored (at 602). 

(3) In the selected control panel, enter a column or columns (at 604). 

For multiple colimms, either list them separated by commas, or use 

a colon or hyphen to denote ranges, or some combination. For 

example, "A:D J^" means columns A,B,C,D, and F. To aid in 
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choosing columns, the user can right-click on the cell containing 
the name of the data sheet, and pick "Open Twin Screen" from the 
shortcut menu that appears, to create a i^ecial dual display. This 
also creates a "Close Twin Screen" button to go back. 

(4) Choose a number of colors to use (at 606), either by entering the 
number of colors or by repeatedly clicking the gray button adjacent 
the cell labeled "# of colors". In preferred embodiments, the 
system allows for five brealqjoints and six colors per rule. 
Accordingly, the numbers will cycle from 1 to 6, and various cells 
below them will be blacked out accordingly. 

(5) Enter the breakpoints that define the color groups (at 60S), in any 
of three modes: 

a) Numeric data, manual mode : enter numbers to form 
the brea]q)oints, i.e., the bomdaries between the 
color groups, one less than the number of colors, in 
increasing numerical order. Cells whose values 
exactly equal a breakpoint value will be colored 
with the lower group (brealq^oint 1 is colored with 
color 1, etc.) 

b) Numeric data, automatic mode : enter either 'Value", 
"log", or "count" as the first breakpoint. If multiple 
columns have been chosen, the user must also enter 
"yes" or "no" opposite "Re-scale all?" at the bottom 
of the panel, to indicate whether each column 
should get its own auto-breakpoints, or whether the 
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auto-breakpoints of the first column (first in list in 
the rule, not first on the data sheet) should be used 
for all. 

This mode reports information about the 
5 brealqwints it determines, and thus could also be 

used to explore the distribution of numerical values 
in a column prior to a final manual breakpoint 
selection. 

c) Text data : enter the strings to be matched and 
10 colored, in preferred embodiments, up to five (5) in 

number. Matching is case-insensitive unless the 
string is enclosed in double quotes (" and '*); 
otherwise, no quotation marks are necessary. 
Several special text strings act as operators if entered as 
1 5 the first word in a rule cell: 



Rule Entry 

(OPERATORS need not 

be uppercase — ^here only 
for emphasis) 


Meanin2 


test string 


color data cell if its whole content matches the test string 


NOT test string 


• color data cell if its whole content does not match the 
test string; 

• will not color numeric cells 


CONTAINS test string 


color data cell if contains the test string as a substring 
anywhere 


NOTCONTAJNS test 
string 


• color data cell if it does not contain the test string 
anywhere; 

• will not color numeric cells 
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Rule Entry 




^vjjr nivA. 1 ijKo necu nox 
be uppercase — ^here only 
for emphasis) 


Meaning 


BEGINS test string 


color data cell if it begins with the test string 


ENDS test string 


color data cell if it ends with the test string 


* (an asterisk) 


(wildcard) color data cell containing any data, including 
numeric cells 


BLANK 


color data cells that are blank 



Using quotes to force matching to be case-sensitive also 

works with strings that follow an operator. 

It is possible to construct a text-coloring rule in 
which certain cells may satisfy more than one of the 
*T)reakpoint" values. For example, if a rule says that 
"active" is colored green and "contains act" is 
colored red, then the word "active" in a cell would 
satisfy both. In such cases, the system colors the 
cell according to the first condition satisfied on the 
list of breakpoints. This dependence on the 

order can be used advantageously to achieve 
complex coloring conditions. The sequence of 
conditions can be considered as a series of filters, 
through which only the as-yet-uncolored cells fall 
through to the next decision. 
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(6) Enter the names of the colors to use (at 610), in the order 
corresponding to the breakpoints. A display of color samples is 
provided at the right side of the Color Control sheet 312. A user 
need only enter the name, and the appropriate cell will become 
colored when the tool is executed. If the user wants the color to 

■ display immediately, he can copy and paste the sample cell into the 
rule's color cell. A special pseudo-color named "SKIP" is used to 
tell the system not to color the cells whose data falls in this group. 

(7) When the rule has been created, the user executes the rule by 
selecting the rule's "Click here to run these" button on the 
panel filled in (at 612, Figure 6C). 

(8) To create different coloring rules for other columns, repeat the 
above in additional control panels. If the user runs out of control 
panels, he can create more control panels by copying an existing 
one and pasting it onto a blank section of the color control sheet. 

To the extent that a single panel cannot hold all the requirements for a 
particular rule, a user can combine two or more panels to create a particular 
rule. For example, if a user needs ten (10) breakpoints, two panels can be 
used. 

With reference to the coloring rule is shown in Figure 6A, once the 
rule has been set, a number of parameters are stored in the system. The 
parameters are "sheet name" QDEMO 3" in Figure 6A), column 
specification, number of colors, array of breakpoints, array of colors, and 
multicolumn scaling mode. 



18 



wo 01/008039 PCT/USOO/20401 
The data coloring mechanism operates as follows, with reference to the 
flowchart of FiGUBE 6C.: 

1 . The user enters the parameters into a rule panel on a "COLOR 
Control" worksheet 312, e.g., as described above with reference to the panel of 
Figure 6A. 

2. The user selects (clicks) the activation button (labeled "Click here 
to run these") on that panel (at 612). This causes the system to: 

(A) Read and interpret the parameters from the panel (at 614). 
The system can identify which button was clicked using the Visual Basic ("VB") 
"caller" property, described in more detail below. The parameters are then read 
based on the identity of the cell location of the button using the VB "TopLeftCell" 
property. The system retrieves the parameters (sheet name, column specification, 
number of colors, array of breakpoints, array of colors, and multicolunm scaling 
mode) from cells in this panel by relative reference to the button cell. 

(B) Next, determine the mode of the coloring rule (at 616) (i.e., 
nimieric v. text or manual v. automatic, and, if automatic, which of value, log or 
count). This uses the analysis of the first breakpoint entry. 

(C) Compile a list of the columns specified in the "column 
specification" parameter (at 618). This is done by scanning the various areas 
contained in the selection, as follows: 

For Each singlearea In Select ion. Areas 
For Each, c In singlearea. Columns 
If Not Count Empty Then 

Ir = Last RowInColumn (c .Column) 
End If 
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If Not Count Empty And Ir = 0 Then 
'skip this empty column 
ncols = ncols - 1 
Else ' add this column to the list 

icol = icol + 1 
colnumarray (icol) = c. Column 
End If 
Next c 
Next singlearea 

(D) If an auto-breakpoint mode is being used (determined at 
620), analyze the data values to determine the breakpoints (at 622). This is done 
by: 

(i) Collecting statistics on the data distribution in each 
specified column; and 

(ii) Calculating the automatic break points for the 
appropriate mode. For example, the auto-value breakpoints 
are determined as follows: 



If breakmode = "VALUE" Then 

interval = (maxvalue - minvalue) / ncolors 

break (0) = minvalue 

For ibreak = 1 To ncolors - 1 

break (ibreak) = bresLk (ibreak - 1) + interval 
Next ibreak 



(iii) Displaying the results for user approval or 

cancellation. 

(C) Loop through the cells in the chosen columns on the chosen 
worksheet (at 624). 
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(D) Compare each cell's value to the list of breakpoints (at 
626). If the coloring rule is in text mode^ use the meanings of the special 
breakpoint operators ("contains", "blank", asterisk "*"; or quotation 
marks). 



10 



(E) When a match is found, apply the appropriate color (at 



628). 



The code below illustrates the processes (D) and (E) for numeric 
breakpoints: 



For Each cell In Range {Cells (StartColoringRow, colnviin) , 

Cells (FinishColoringRow, colnum) ) 
cvalue = cell. Value 
colored := False 
15 If IsNumeric (cell .Value) Then 

If Not IsEmpty (cell) Then 

' (have to test both because EMPTY Is numeric) 
For ibreak = 1 To ncolors - 1 

If cvalue <= break (ibreak) Then 
20 If Color (ibreak) <> "SKIP" 

cell . Interior . Colorlndex = 

Color (ibreak) 
Call Text Contrast (cell) 
End If 

25 colored = True 

Exit For 
End If 
Next ibreak 
If colored = False Then 
30 ' Not hit yet? Must be top category, so: 

If Color (ncolors) <> "SKIP" 

cell . Interior . Colorlndex = Color (ncolors) 
Call TextContrast (cell) 
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End If 

colored = True 
End If 
End If 

Else ' not numeric - just don't color it 
End If 
Next cell 

The operation of the data coloring tool of this invention will now be 
described in greater detail. Each coloring rule is provided in a coloring control 
panel that has the general form of a coloring rule as shown in FIGURE 6A. In one 
preferred embodiment, each coloring control panel 144 is a double-outlined xmit, 
sixteen (16) cells high by two (2) cells wide. As noted above, a user is provided 
with coloring control panels on the color control worksheet 312. A user can use 
the coloring control panel 144 to set the sheet and column(s) on which the rule is 
to operate, the number of colors, the various break points and the colors associated 
with those breakpoints. The sheet is set by entering its name into the cell 146 
adjacent the cell labeled "sheet". The column (or columns) on which the rule is to 
operate is (are) set by entering its (tiheir) name in the cell 148 adjacent the cell 
labeled "column(s)'*. The number of colors is set by the user by selecting the cell 
150 adjacent to the cell labeled "# of colors". Each time the cell 150 is selected it 
increases the number of colors, up to a maximum of six (6), i.e., rotating through 
the values 1 to 6. I.e., when the cell 150 shows a "6" and is selected, it reverts 
back to "1". That is, selecting the cell 150 causes the value in the cell to cycle 
from "1" to "6" and then back to "1". 

A Visual Basic ("VB") macro function CCailCoiorColumjf') is associated 
with the top cell 152 of the control panel 144. When the cell 152 is selected by 
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the user (with the mouse 110 or the like), the function associated with that cell is 
executed by the computer (CPU 102). In the presently preferred embodiments, the 
CallColorColumn function extracts the button name of the cell 152 and then calls 
a second function Q'CallColorColumnSubroutim'') with that button name as one 
of the parameters. The function CallColorColumnSubroutine takes three 
parameters, namely ButtonName, StartColoringRow, and FinishColoringRow. 
The two parameters StartColoringRow^ and FinishColoringRow are optional. 

First, the function CallColorColumnSubroutine determines what specific 
values to use for the coloring by reading them from the control panel 144. Since 
the values are all in fixed positions relative to the selected button cell 152 that 
initiated the call to the function CallColorColumn, the values can be determined 
once the location of that button cell 152 heis been determined. This is done using 
the following Visual Basic code: 

Sheets ( "Color Control") .Activate 
headingrow = 

ActiveSheet . Buttons (ButtonName) . TopLef tCell . Row 
headingcol = 

ActiveSheet .Buttons (ButtonName) . TopLef tCell . Colunm 

Note that if the function CallColorColtmnSubroutine was called from 
another sheet (not "Color Control") then this method will not find it. 
The various parameter values are then read as follows: 
Sheet name: 

datacol = headingcol + 1 

sheetname = Trim (Gel Is (headingrow + 1, datacol) .Value) 
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If there is no sheet named "sheetname" an error function is called. 
Generally, in preferred embodiments, a great deal of error checking takes place to 
ensure that the user is provided with a friendly and useable interface to the 
program. Most of the error checking is not mentioned in this description, 
however, one skilled in the art would know what kinds of error checking to 
implement in order to provide a user-friendly working environment. 
The column(s) to be colored are specified by: 

Cells (headingrow + 2, datacol) .Value 

The number of colors is specified by the variable ncolors^ where: 

ncolors = Cells {headingrow + 3, datacol) -Value 

Within the fimction CallColorColumnSubroutine there are two arrays, 
named break and color ^ which are used to store the breakpoints and colors, 
respectively. The first breakpoint is set as follows: 

break(l) = Cells (headingrow + 4, datacol) .Value 

The value of the first breakpoint is used to determine the break mode 
('^NUMERIC", '*VALUE", "LOG", or "COUNT"), ^breakfl) (as determined 
above) is numeric, then the mode is set to 'T>njMERIC", otherwise, if break(l) is 
one of "VALUE", "LOG", or "COUNT", then the break mode is set to that mode, 
otherwise the break mode is set to "TEXT". 
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Next, the function detemiines whether multiple columns were specified, in 
which case it detemiines whether the user selected to re-scale all the columns. 
The user's re-scale selection is determined by: 

res cale_all_st ring = Cells (headingrow + 
15,datacol) .Value 

Now the rest of the breakpoints (if any) are read. If the break-mode is 
"AUTO" then the breakpoints are set as follows: 

For i = 2 To lastbreaknum 

break(i) = Cells (headingrow + 3 + i, datacol) .Value 

Various possible errors are checked for. E.g., if any breakpoints are 
missing (i.e., if break(I) is einpty, the user is notified. Also, if the break mode is 
"NUMERIC" and non-numeric breakpoints are set, the iiser is notified. If 
numeric breakpoints are not in increasing order, the user is notified. As noted 
above, generally, in preferred embodiments of the present invention, a great deal 
of error checking is performed on all user inputs to ensure that the values are 
correct and consistent. Most error checking is omitted fi'om this description. 

The CallColorColumnSubroutine maintains an array, colomame, which 
maps integers to colors. In preferred embodiments, there are fifty six (56) colors 
available. To use the higher numbered colors, the computer's video card must be 
set appropriately. Using the colorname array, the program next associates the user 
provided color names with index numbers. Specifically, for each of the (up to six 
in a preferred embodiment) colors specified, the user specifies an actual color 
name, denoted cname. This name is determined for each color by: 

For j = 1 To ncolors 

cname = Cells (headingrow + 8 + j, datacol) .Value 
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The interior of each color-specifying cell is then colored by the appropriate 
(selected) color by setting the color property (Interior.Colorlndex) of the cell: 

Cells (headingrow + 8 + j, datacol) . Interior . Col or Index = 
Color (j) , 

where the value of the variable j ranges from 1 to ncolors. 
Then the cell is further processed by a function TextContrasL 

Call TextContrast (Cells (headingrow + 8 + j, datacol)) 
With the parameters read and checked, the system is ready to process and 

color the selected sheet (specified at cell 146 in Figure 6A). The selected 

columns (specified in cell 148 in Figure 6A) in the selected sheet are processed 

one-by-one by the following program code: 

Call Parselnput (InString, inspecifier) 
For Each singlearea In Range (inspecifier) .Areas 
For Each c In singlearea. Columns 
colnum = G . Colunin 

Call ProcessOneColumn (colnum, ncolors, break. 

Color, brealanode, 
rescale_all, sheetname, 
StartColoringRow, 
FinishColoringRow) 

Next c 
Next singlearea 

The processing performed by the function ProcessOneColumn is as 
follows: The column designated by colnum on sheet sheetname is to be colored 
according to the breakpoints in the array break and the colors in the array colors. 
The designated column is colored fi:om the row corresponding to 
StartColoringRow to the row corresponding to FinishColoringRow. Note that the 
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function ProcessOneColumn is also provided with the break mode and the 
variable rescalejilL 

Function ProcessOneColumn first calculates the automatic breakpoints, if 
necessary. Note that automatic breakpoints are detemiined from the whole 
, column, even if this call says to color only a limited range of rows. Ifthe value of 
breakmode is "VALUE" or "LOG" and the value of rescale_all is set to "True" Or 
the value of the first breakpoint (break(l)) is set to "VALUE" or "LOG", the 
program calls the fyoicHonAutoValueBrealkpoints as follows: 

Call AutoValueBreakpoints (colnum, colletter, ncolors, 
break. Color, breakmode, rescale_all) . 

Otherwise, if the breakmode is set to "COUNT" and the value of 
rescalejzll is set to "True" or the first breakpoint (break(IJ) is set to "COUNT", 
then the program calls the fvaiction AutoCotmtBrea^oints^ as follows: 

Call AutoCount Breakpoints (colnum, colletter, ncolors, 
break. Color, breakmode, rescale_all, sheetname) . 

With the breakpoints calculated, the columns are colored according to the 
type of breakpoints specified by the user. Specifically, when the breakmode is any 
one of "VALUE", "COUNT", "LOG", or "NUMERIC", the system executes a 
function ColorNumericCoIitmn, On the other hand, when the breakmode is 
"TEXP*, the system executes a function ColorNumericColumn. The VB code for 
this is as follows: 

Select Case breakmode 

Case "VALUE", "COUNT", "LOG", "NUMERIC" 
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Call ColorNumericColuran (col letter , ncolors, 
break. Color, StartColoringRow, FinishColoringRow) 
Case "TEXT" 

Call ColorText Column (colletter, ncolors, break. 
Color, StartColoringRow, FinishColoringRow) 
End Select 

Then, when the rule in control panel 144 is selected for execution, the rule 
is applied to the selected column(s) (denoted in cell 148) of the named sheet (in 
cell 146). For each column in the named sheet, the value in each cell is compared 
to the various breakpoints and the cell is colored corresponding to the appropriate 
breakpoint. 

Examples of the application of various coloring rules in the "COLOR 
Control" worksheet of Figures 4A-4B, are shown with reference to the data in 
worksheet "DEMO 2" (depicted in Figures 9A, 9B, lOA and lOB). 

Color-Mediated Data Mining 

As noted above with reference to Figure 2, once the data have been 
colored according to the user-selected rules (at 126), the user can then perform 
color-mediated data mining (at 128). The presently preferred embodiment of this 
invention provides five mechanisms (each discussed below) for color-mediated 
data mining, namely mechanisms to: 

1 . enlarge/shrink cluster starts; 

2. vertically re-scale the display; 

3. score and sort clusters; and 

4. score and sort dose-response data. 
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The following discussion refers to the cluster control worksheet which is 
shown in Figures IIA-IIB. 



1. Enlarge/Shrink Cluster Starts 

The "Enlarge Cluster Starts" mechanism highlights the first row of each 
cluster in clustered data by enlarging the font of the cell containing the cluster 
number or label, thus enabling size reduction of the spreadsheet for the user to 
focus on the color pattems. When the cell height is dramatically reduced in order 
to see more cells on a screen or printed page, this enlargement allows the user to 
still read the label at the beginning of each cluster. The mechanism takes user 
input firom a Cluster Control worksheet. A corresponding mechanism ("Shrink 
Clusher Starts") allows for undoing the enlarging. This mechanism handles 
cluster numbers or textual labels. Any column can be designated as the cluster 
labels to be processed. 

Operation of the mechanism is as follows: 

(1) From the "CLUSTER Control" sheet 314 pick a control panel to 
use (one which is empty or one containing inputs no longer 
needed). On this sheet, a single control panel extends vertically 
through the black, blue, red, and green sections, and provides input 
information for several tools. 

(2) In the blue section, enter a sheet name and the column to be 
considered as the cluster labels. 

(3) Click either the blue-text "Enlarge Cluster Starts" or "Shrink 
Cluster Starts" button. 
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The program code accomplishes this by scaiming the column of cluster 
labels, identifying any entries that are different fix>m the one immediately above, 
and enlarging them. Code that carries out this function is shown below: 

For Each c In Range (Cells (3 , colnum) , Cells (lastrow, 
colnum) ) 

irow = c.Row - 1 

icol = c. Column 

If c. Value <> Cells (irow, icol) .Value Then 

c. Pont. Size = bigfontsize 
' Rows (Irow + 1) ,RowHBight = Jbigrowhelgiht 

End If 
Next c 

Example 

An example of the application of the enlarge cluster mechanism is shown 
in Figure 12 which shows the application of a rule (shown in the control panel 
Figure IIC) from the cluster control worksheet in Figure IIB to the data of 
worksheet "DEMO 2" as shown after coloring in FIGURES lOA-lOB. As shown 
in Figure IIC, the rule is to be applied to column B of sheet "DEMO 2". 

2. Vertical Display Re-Scaling 

The vertical re-scaling mechanism operates by taking a user-provided scale 
factor £uid then changing height of data rows to facilitate visualization of large- 
scale color patterns. The mechanism leaves column heading heights and column 
widths unchanged. This makes headings remain readable and greatly simplifies 
examining long columns of data for color patterns. 
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Figures 13A-13D show the apphcation of the vertical display re-scale 
mechanism according to the present invention. FIGURES 13 A-13B show some of 
the data m the worksheet labeled *T)EMO 3" 306 (Figure 13A shows the first 
thirty eight or so elements and Figure 13B shows the remaining elements of that 
worksheet). As can be seen from the figures, the worksheet 'TDEMO 3" 306 has 
three hundred and twenty eight (328) data entries (in rows 2-329). The user can 
vertically scale the display by selecting "Re-scale Vertical" from the system's 
special menu or by pressing a particular control key sequence (e.g., "CNTL- 
SHIFT-V" in a preferred embodiment). This presents the user with a dialog box 
318, as shown in Figure 13C, which asks the user to enter a scaling factor 
. relative to the current size. The user enters a scaling factor to enlarge or reduce or 
restore the display. In the example shown, the user enters a scaling factor of 0.1 
which produces the vertically scaled display shown in Figure 13D. 

Vertical scaling allows a user to get an overview of the data, based on the 
coloring. 

The portion of the program code presented below carries out the central 
function of the vertical di^lay rescaling mechanism: 

rowspec = "2:" & lastrow ' leaves the headings unchanged, 

i.e., readable 

If factor = -1 Then 

Rows ( rowspec ) . Rows . AutoFi t 

Else 

For irow = 2 To lastrow 

Rows { irow) . RowHeight = 

Rows (irow) .RowHeight * factor 

Next irow 
End If 
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After execution of the rescaling mechanism, as can be seen in 
FIGURE 13D, the height of each row (except the heading rows) has been scaled by 
factor, 0,1 in the example shown. In this manner, all rows of the data are made 
visible on a single page, thereby facilitating data analysis. 

3. Scoring and Sorting Clusters 

Scoring and sorting clusters assign numerical scores to the color patterns 
of individual rows or clusters of rows, thereby enabling comparison and sorting of 
the clusters by score. 

The scoring and sorting mechanism accepts user's designations of colors 
and corresponding relative scores. It handles cluster numbers or textual labels. 
Any column can be designated as the cluster labels to be processed. The 
mechanism scores a user-selected list of colimms of data, with user-assigned 
relative weights, which need not be equal for all colxunns. 

User input is taken from a Cluster Control worksheet 314 (see 
Figures llA and IIB), which stores any number of parameter sets, each one 
with a user-specified name. 

The input data is automatically sorted by cluster label before starting, in 
order to group the clusters together in case the user has previously sorted the data 
by some other criterion. Then scores are normalized to remove the effects of 
cluster size, absolute magnitude of scoring points chosen, and absolute size of 
weights chosen. The results are written to two new worksheets without altering 
the original data sheet. The first derived sheet is for the numerical scores; the 
second is like the original, but has the clusters sorted into descending score order, 
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SO that the "best" are at the top, removing the need to visually scan a long colored 
worksheet The derived output sheets have names that indicate their source data 
sheet and the name of the parameter set used for scoring. At the user's option, the 
system reversibly hides the un-scored columns in the cluster-sorted output sheet, 
focusing attention on the data that were used in scoring. 

In preferred embodiments, the system detects uncolored cells in the data 
and offers the user two programmed modes of dealing with them, (uncolored = 
entry on user's list of scores or xmcolored = "average of other colors in row"), or 
the option of stopping to color them manually. 

If the user designates a column of individual compound labels as the 
"cliister labels," then the system compares single compounds rather than clusters. 

The mechanism operates as follows, with reference to Figures IIA-IIC. 

(1) On the "Cluster Control" sheet 314 the user picks a control panel (e.g., 
panel 1100) to use (a panel which is empty or one containing non- 
needed inputs). On this sheet, a single control panel extends vertically 
through the black, blue, red, and green sections, and provides input 
information for several tools. 

(2) In the top black section 1102 of the selected control panel 1100, the 
user gives this new parameter set a name if not already done. The 
name will be used to label the outputs. 

(3) In the blue section, the user enters a sheet name (in 1104) and the 
column (in 1106) to be considered as the cluster labels. Note: To score 
each compoimd separately rather than in clusters, enter a column with 
individual compoimd labels as the "Cluster Col." 
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(4) The red section of the control panel is divided into two parts, with its 
action button 1108, with red text "Score and Sort Clusters", in the 
middle. Above the button, enter the names of the colors 1110 to be 
assigned point scores, along with their corresponding point scores 
1112. The scores are arbitrary and relative; they will be normalized by 
the system as necessary. However, a user shoxild be sure always to 
assign higher point scores to colors which denote favorable values, and 
lower point scores to colors which denote unfavorable values. The 
cells with entries need not be colored, and need not be in score order, 
because the system will color and sort these cells when run. 

When assigning point values, a user should be aware that 
uncolored cells (which are most likely blank, i.e., unknown data) may 
have quality values above or below those that contain grouped and 
colored data.. The user may decide that some of the colored groups 
are "better" or *Svorse" than data being imknown, and can assign a 
score to the color ^TMONE" accordingly. 

(5) Below the "Score and Sort Cluster" button 1108, the user enters the 
columns 1114 to use for scoring (using the same syntax as for the Data 
Coloring) and their corresponding relative weights 1116. The numbers 
for weights are arbitrary and relative; they will be scaled by the system 
as necessary. Note that a line with multiple columns will assign the 
entered weight to each of the columns. 

(6) The user the selects (clicks) the red-text "Score and Sort Clusters" 
button 1108. 
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(7) When the scoring and sorting tool runs (on the system 100), if the 
system detects nncolored cells in the data, the user will be offered two 
modes of dealing with them automatically, or a third manual option of 
stopping to color them. The two modes are: 

• "Use score for the color "none" on my list" 
(RECOMMENDED) 

• "uncolored = average of other colors in row". 

(8) The program then scans the chosen columns in each row and adds up 
the chosen column's color scores for that row. These scores are then 
averaged for each cluster of rows, as defined by the user-selected 
"cluster column." The VB program code which accomplishes this is as 
follows: 

For icol = 1 To ncols 
colorcode = 

Cells (irow, colnumC icol) ) . Interior . Colorlndex 
colorfoimd = False 
' Add up the weighted scores 
For j = 1 To ncolors 

If (icolor{j) = colorcode) Then 
j score = j 
CO lor found = True 
Exit For 
End If 
Next j 

If colorcode = xlNone And treatblanks = "AVERAGED" 
Then col or found = True 

If Not color found Then 

Cells (irow, colnum (icol) ) -Select 
If colorcode = xINone Then 
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thiscname = "none" 

Else 

Call ColorNameToIndex {thiscname, colorcode. True) 
End If 

addscore « score (jscore) * colweight (icol) 
cmpdscore = cmpdscore + addscore 
' Next IF -THEN" ELSE block is 

' special calculations for the "averaged" mode 

If colorcode = xlNone Then 

lostweight = lostweight + colweight (icol) 
minscore = Application. Min(minscore, 0) 
maxscore = Application.Max (maxscore, 0) 

Else 

cmpdscore 2 = cmpdscore2 + addscore 
End If 
Next icol 

( 9 ) The scores are then normalized for the various cluster sizes (number 
of rows per cluster), and scaled to a value of one hundred (100) for a 
row which is colored entirely with the user's highest-scoring color and 
a value of zero for a row that is colored entirely with any color to 
which the user has assigned a score of zero. 



If clusterscore (icluster) = 0 Then 

' do nothing 
Elself clusterscore (icluster) > 0 Then 
clusterscore (icluster) = 

100 * clusterscore (icluster) / (nrows * maxscore) 
Elself clusterscore (icluster) < 0 Then 
clusterscore (icluster) = 

100 * clusterscore (icluster) / 
(nrows * (-minscore)) 

End If 
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(10)The results are presented as two newly inserted worksheets. The first 
is named by appending the word "SCORES" to the name of the 
original data sheet, and contains a list of the clusters with their sizes 
and scores. 

(1 l)The second new sheet is named by appending the word "SORTED" to 
the name of the original data sheet. The "SORTED" sheet contains a 
copy of all the original data and coloring, but with the rows re-ordered 
to place the highest-scoring clusters at the top, and all the clusters in 
descending score order from there down. 

(12)The user has two additional options regarding the appearance of the 
"SORTED" sheet: (a) a column containing the numerical scores can be 
added; and (b) the columns that were not used in the scoring can be 
hidden, so that only the ones actually used remain visible. 

An example of user provided data is shown in the control panel in 
Figure IID which is taken from die cluster control worksheet shown in 
Figure 11 A- As shown in Figure llA, the parameters are stored with the name 
"Cmpd" 1102. The scoring a sorting parameters in the control panel 1100 of 
Figure IID give the color red a score of "-1", orange has a score of "0", yellow 
has a score of "1" and light green has a score of "2". Columns C and D have 
relative weights of "1", as does column E. 

Note on the output of score and sort clusters: The system inserts two new 

sheets after the data (see, e.g., FIGURES 14C-14D). The first added sheet contains 

two score columns: the scores generated by both of the auto modes (vmcolored = 

zero and uncolored = average), but the one not selected will be gray. The scores 
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are on a scale of "-100" to "+100", where a score of "-100" means that all cells 
had the maximally negative score available, and a score of "+100" means that all 
cells had the maximally positive score available. The second added sheet has 
clusters sorted according to the one auto mode chosen when the tool ran. The 
routine offers to hide all columns that were not used in the scoring and sorting. 
The user can selectively unhide certain columns by using the "EditrGoTo" menu 
option (or typing "CTRL-G")> enter the columns in the "Reference" box (for 
example, C:F), then pick the 'Tormat:Colimm:Unhide" menu option. 

If the user wants to see a color-score-sorted list of compounds within a 
particular cluster (such as the best cluster), the user should do the following: 

1 . Sort by clusters to JBnd the ID of the cluster wanted. 

2. With a second rule, sort by compoimds. 

3. Go to the "SORTED by Compound" results sheet and turn on Excel's 
"Data:Filter; AutoFilter" feature for the column that specified the 
clustering in the first sort. The user can then choose to view only the 
compoimds in one particular cluster, and they will be in compound- 
sorted order. 

Example 

With reference to the already-colored worksheet "DEMO 1" shown in 
Figures 8A-8B, the cluster control worksheet shown in Figure 11 A, and the 
control panel shown in Figure IID, application of the scoring and sorting of 
clusters is described. As noted above, in the control panel of FIGURE IID, the 
parameters are stored with the name "Cmpd" 1102. The color red has a score of 
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"-1", orange has a score of "0", yellow has a score of "1" and li^tgreen has a 
score of "2". Coliimns C and D have relative weights of "1", as does colunm E. 

Application of control panel "Cmpd" of FIGURE IID, by selecting "Score 
and Sort Clusters", produces the worksheets shown in Figures 14A-14B. When 

5 the user selects the "Score and Sort Clusters" button 1108 for the "Cmpd" control 
panel of Figure IID, the system first presents a dialog box (1402 shown in 
Figure 14A) asking the user how un-colored cells should be scored for sorting. 
As noted above, un-colored cells can be scored explicitly by user entries 
(recommended) or as the average of the colors in the same row. Once the user 

10 makes a selection and clicks on the "OK" button, the system scores and sorts the 

data, producing the display screen shown in Figure 14B. The system provides a 
summary of what was done, including the information about the two new sheets 
("DEMO 1 SCORES by Cmpd" and "DEMO 1 SORTED by Cmpd un=ze") 
which are added to the workbook. Figures 14C-14D show the data in the newly 

15 created worksheet "DEMO 1 SCORES by Cmpd". 

Example 

With reference to the already-colored worksheet "DEMO 2" shown in 
Figures lOA-lOB, the cluster control worksheet shown in Figure llA, and the 
control panel shown in FIGURE IIC, application of the scoring and sorting of 
20 clusters is described. In the control panel of Figure 11 C, the parameters are 

stored with the name "acids" (1102). The color red has a score of "0", orange has 
a score of "1", yellow has a score of "2" and light green has a score of "3". 
Colunm D has a relative weight of "1". 
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The application of the parameters or rules in the "acids" control panel 
produces two new worksheets ("DEMO 2 SORTED by acids" and "DEMO 2 
SCORES by acids") shown in FIGURES 15A-15B. 



4. Score and Sort Dose-Response Data. 

Data grouping and visualized by color coding has also been found to 
enable an automated solution to another vexing pattern recognition problem. An 
HTS lab is currently able to provide dose-response data on some subset of the 
whole collection of compounds originally tested. Sometimes, logistical 
constraints (time and/or cost) dictate that only a few concentration points can be 
run on each compoimd, and the high-throughput nature of the process generates 
somewhat noisy data. A similar situation sometimes exists in other biological 
labomtories where assays are very time-consuming. Dose-response curves with 
few, noisy points are difficult to analyze by traditional curve-fitting methods. The 
present invention includes a mechanisms/algorithms for analyzing percent-of- 
maximal-efifect data and accurately ordering the compounds by potency, even 
when faced with few points and high noise. 

The mechanism recognizes two properties of the dose-response data for 
each compound: 

1 . "Dose-responsiveness," the drop-off of activity with dilution, is taken 
as a sign that the compound has some reasonable pharmacological 
mechanism of action. 
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2. The activity measurements at the various concentrations also provide a 
confirmation of the general level of each compound's activity that was 
indicated by the original single-poke HTS hit. 

These two properties are somewhat independent, as illustrated by the 
example of a compound that is 95% active at all tested concentrations. It 
demonstrates very poor (i.e., no) dose-responsiveness over the range of 
concentrations tested, but is so active that it should not be ignored, because it 
might reveal a dose response if tested at even lower concentrations. 

By using the data groi^ings and color codes of the dose-dependent activity 
data columns, vsdiich help to smooth out excessively fine distinctions in the 
numbers, this invention includes an algorithm to assign numerical scores for dose- 
responsiveness and overall activity in the dose-response data. Moreover, the 
algorithm also calculates a smart composite of these two scores, in such a way that 
a highly active compound will get a high composite score even if its dose- 
responsiveness is poor. This composite score is capable of extracting useful 
information, even from very noisy data, and has been validated to correctly order a 
list of test compounds. The system of this invention adds data columns that report 
all three scores for each compound, and these columns can themselves be color 
coded, and thus used in further comparison to other types of data by compound or 
cluster scoring and sorting as described above. 

Moreover, within certain limits, the invention's dose-response scoring 
algorithm can also be used to make quantitative estimates of IC50 values of 
compounds, even in the presence of large amounts of experimental error. This is 
accomplished by adding a set of hypothetical marker compounds with known 
potencies and theoretically calculated activities at the test concentrations. Since 
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the ordering algorithm is reliable, these markers will be ordered into their 
appropriate place, and can be used to calibrate the ordering scores in terms of 
actual ICso's. In other words, estimates of IC50 for the compounds can be 
generated by interpolating between the markers in the ordered list of composite 
scores. 

Scoring and sorting dose-response data according to the present invention 
processes several columns of colored dose-response data (activity vs. 
concentration) to assign three numerical scores that can later also be color coded, 
and thus used by the "Score and Sort Clusters" mechanism (described above) to 
compare compoimds or clusters of compounds. The three scores are: 

(a) degree of dose-responsiveness over the concentration 
range tested; 

(b) overall activity level demonstrated in the dose-response 
data; and 

(c) a variably weighted composite of (a) and (b), designed 
to give high scores for high activity even when dose- 
responsiveness is poor (e.g., a compound that is highly 
active at all concentrations). 

The scoring and sorting dose-response data according to the present 
invention bases its scoring on colors rather than absolute activity mmibers. The 
mechanism takes user input from a Cluster Control worksheet, e.g., as shown in 
Figures IIA-IIB. Figure IIB shows a control panel from the cluster control 
worksheet shown in Figure 11 A, wherein the user has selected columns F to I of 
worksheet "DEMO 1" for scoring dose-response. 
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The system detects xmcolored data, notifies the user, and asks whether to 
continue. Ifyes, the system skips the row containing the uncolored data. The 
system inserts three new columns on the original spreadsheet to contain the new 
scores, the new columns immediately following the columns of dose-response 
data. The column headings show the name of the parameter set used for scoring. 
Preferably, the system offers to regenerate existing table of Sheet Statistics to 
correct it for newly added score columns. Further, the system offers to sort the 
data rows by decreasing score. The system also offers to carry out quantitative 
estimates of IC50 values for the user's compounds, by adding artificial calculated 
calibration marker compoimds. 

In order to score and sort dose-response data: 

(1) Ensure that the dose-dependent activity data columns are ordered 
with highest concentration at the left and lowest concentration at 
the right. To ensure this, the system will remind the user of this 
requirement and ask him to confirm it when this tool is run. Note: 
If the data are for an imdestred effect such as toxicity, the columns 
should be ordered the opposite way (lowest concentration left, 
highest right). 

(2) Use the Data Coloring (described above) to color the dose-response 
data columns. 

(3) Go to the Cluster Control sheet 314 (Figures IIA-IID) and pick 
a control panel 1100 to use. On this sheet, a single control panel 
extends vertically through the black, blue, red, and green sections, 
and provides input information for several tools. 
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(4) In the top black section, give this new parameter set a name (1102) 
if not already done. The name will be used to label the ou^uts. 

(5) In the blue section, enter a sheet name (1104). 

(6) In the red section (1110), enter the colors used to color the dose- 
dependent data, and relative point scores (1112) to be assigned to 
these colors. 

(7) In the green section (1118), enter the columns which contain the 
dose-response data (using the same syntax as for Data Coloring). 

(8) Click the green-text "Score Dose-Response" button (1120). 

(9) If the data are expressed as "percent of maximeil effect," the user 
can follow the prompts to add calibration markers and make 
quantitative estimates of ICso's. 

Note on the ou^ut of score and sort dose-response: the system inserts 
three score columns after the dose-dependent data. The three scores are all scaled 
to a 0-100 range, and have meanings as follows: 

(a) degree of dose-responsiveness over the concentration range 
tested: 

100 = smoothly decreasing with dilution, spanning the 

entire range of color groups; 
75 = flat dose-response; and 
<75 = even more poorly behaved 

(b) overall activity level demonstrated in the dose-response data 

100 = highest activity color group at all concentrations. 
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(c) a variably weighted composite of (a) and (b), designed to give 
high scores for high activity even when dose-responsiveness is 
poor (e.g. a compound that is highly active at all 
concentrations). 



The Dose-Responsiveness Scoring Algorithm 

The data columns are ordered left to right, by decreasing concentration. 
The scoring algorithm awards positive score points for each dilution step across 
the data that actually shows a decrease in the activity data group (i.e., the color), 
and to penalize every step that does not. The algorithm uses the following 
scoring: 

• +1 point (awarded) when a dilution step moves to a lower activity group 

• 0 points when a dilution step leaves the activity group imchanged 

• -3 points (penalty) when a dilution step moves to a higher activity group 
The maximum score would go to a compound that shows all the possible 

color group steps in the right direction, and has no reversals. The minimum score 
would go to a compound with all the possible reversals, and no correct steps. The 
program then scales the extremes to 100 and 0, in order to present a consistent 
interface to the user. 

The relative magnitudes of the scoring parameters were empirically arrived 
at by testing "complete sets" of color patterns. This is possible because of the data 
simplification afforded by the value grouping. If we define the following 
numerical parameters: 

C = number of colors used, i.e., number of data value groups 
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P = number of points measured, i.e., number of different concentrations 
(doses) tested, 

then the entire **universe" of possible color patterns includes (C**) different cases. 
For some typical values that might be encountered in real HTS data, this total 
number of cases is manageable in Excel, as shown by Table 1, below. 



Table 1, Total Number of Color Patterns 



p = 


C = 


total number 


# of cone. 


# of color groups 


of 


Points 




possible cases 


3 


3 


27 


3 


4 


64 


4 


3 


81 


4 


4 


256 


5 


3 


243 


5 


4 


1024 


6 


3 


729 


6 


4 


4096 


7 


3 


2187 


7 


4 


16384* 



* For a spreadsheet with a heading row, this exceeds ExCEL's current limit (for 
EXCEL 95) by one. This value should not exceed the limit for Excel 97. 

Scoring was done on several of these complete sets, la each set, the results 
were sorted by decreasing score and compared to "intuition" for general 
correctness of ordering of dose-responsiveness, and scanned for cases deemed to 
be clearly out of order. The (+1, -3) score set was found to produce satisfying 
ordering, while lesser penalties led to poorly ordered results. More objective tests 
of ordering (described below) were then used to further validate the algorithm 

The case of P=3 and C=3 is presented below in its entirety for illustration. 
Table 2 (figure 16F) shows artificial data and processing for twenty seven (27) 
hypothetical compounds. The "percent inhibition" colmnns represent assay 
"data." If one defines three groups by breakpoints at 33% and 66%, each cell is 
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assigned to a data group as shown in the middle set of three columns. Here it is 
clear that the order of compounds in diis table is systematic (111, 112, etc.), to 
illustrate that the complete set is present The third set of three columns shows 
color coding, with the darkest being least active and the lightest being most active. 

Then the data set was processed by the system to yield dose- 
responsiveness scores, and the results sorted by this score, giving Table 3 
(Figure 16G), the complete set in order of decreasing dose-responsiveness. 
Table 3 also shows the intermediate step-scoring and unsealed score points, to 
aid in foUowing and understanding the algorithm. These points are not displayed 
by the system itself. 



The Overall- Activity Scoring Algorithm 

The second property of interest to be extracted fiom the data is the overall 
activity level exhibited by each compound. As explained above, this is largely 
independent of the dose-responsiveness. 

The data value groups* ordinal index numbers are used as single-point 
activity measures instead of the original data numbers. Extra weight is given to 
activity shown at lower concentrations by the simple algorithm of weighting each 
data column by its serial position, again ignoring the actual concentration values. 
The scores are then scaled to the range 0 to 100. The results of this scoring on the 
same complete set are shown in Table 4 (Figure 16H) which has been re-sorted 
by decreasing overall activity. 
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Comparison of Tables 3 and 4 (Figures 16G & 16H) shows clearly that 
the compound ordering by dose-responsiveness is quite different from the ordering 
by overall activity. The user (a chemist) could now color-code the new score 
columns and use them as independent factors in a larger scoring. However, 
chemists also want a single index of compound quality derived from the dose- 
dependent data. Moreover, a composite index would firrther help to alleviate the 
effects of noise on data interpretation, by incorporating more information into the 
ordering process. This is an "information-based smoothing" of the data. 
Therefore, a procedure to calculate a third, "smart composite" score from the other 
two scores was devised. 

The general idea is that when selecting good compounds from dose- 
response data, compounds showing overall high activity should not be discarded 
for lack of responsiveness. Therefore, the smart composite score should give 
more weight to the overall activity when the overall activity is high, but lower 
weight when it is low. A generalized weighted average is v^itten as 

composite score = (activity weight) (activity score) + 

(responsiveness weight) (responsiveness score) 
or, defining corresponding symbols: 

Sc = (Wa)(Sa) + (WiO(Sr) 
If the weights are normalized to sum to unity, then this becomes 

Sc = (WaXSa) + (1-Wa)(Sr) 

The activity weight Wa varies with the activity score Sa in such a way as 
to achieve the desired result. 



48 



wo 01/008039 PCT/USOO/20401 

The functional fonn of this variation was the subject of empiric8il testing. 
It was decided that the limits would be that Wa would approach 0.5 (activity and 
responsiveness equally weighted) in the limit of low activity, and that Wa would 
approach 1 .0 (responsiveness ignored) in the limit of high activity. The actual 
variation was encoded as an exponential increase in order to have rather sharp 
onset of the activity bias at higher activities: 

WA = (Ci)exp[(k)(SA)] + C2 

The value of the coefGcient k=0. 06, for which the activity bias starts to 
become substantial around an activity score of eighty (80), was chosen for 
implementation in a preferred embodiment of this invention, according to 
empirical results. Figure 161 shows the variation for a few values of Tables 
(Figure 16J) shows all three scores for the example complete set, now sorted by 
decreasing composite score. 

The details of the scoring algorithms were arrived at largely by comparing 
results to intuitive orderiag of the test cases in the complete sets. Because the sets 
were complete, no really new results can be generated by further test sets. 
However, one can generate test activity data sets from compounds of known 
potencies, whose real rank ordering is thus known, in order to see more 
objectively how well the scoring algorithms rank the results. 

To this end, a set of pseudo-ligands was hypothesized, with dissociation 
constants from a fictitious receptor ranging from nanomolar to millimolar (pK = 9 
to 3). The set included thirty one (31) compounds, with potencies evenly spaced 
by 0.2 log units (9.0, 8.8, 8.6, ... , 3.4, 3.2, 3.0). A "pseudo-screen" was created 
which **tested" binding of these ligands at five concentration points in the usual 
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range: 10"^ M, 3x10"^ M, 10"^ M, 3x10"^ M, and 10'^ M. Note that the span of 
potencies exceeds the span of concentrations tested by two log units on each end, 
so the test set includes both "very active" and "very inactive" compounds relative 
to the screening concentrations. 

Then artificial binding data vsrere created by calculation as follows. 
Assuming a simple binding equilibrium of the ligand to a receptor, the "percent 
inhibition" at a given ligand concentration is equal to the fraction of receptor sites 
which are occupied by ligand, given by simple equilibrium equations as 



Pinhib = 100 • (ligand) / [K + (ligand)] 



For a more realistic simulation, artificial random noise was then added to 
the calculated numbers. The first experiment reported below used noise randomly 
distributed over the range of ±10 inhibition percentage points, and the second 
with noise up to ±30 inhibition percentage points. Note that this means ten or 
thirty percentage points of absolute error, not 10% or 30% of the value. 

The artificial data were then color-coded according to the mechanisms of 

this invention (described above) into four color groups, using the simple 

breakpoints at 25, 50 and 75 percent inhibition. Note that in assigning these 

breakpoints^ no consideration was given to the actual data values. Then the 

scoring algorithms of this invention were run, and the compounds sorted by the 

composite score. Rank order numbers were assigned to the compounds, with 1 

being the most potent and 31 the least. In cases of ties in the composite score, 

equal rank numbers were assigned, with a value equal to the average of the rank 

numbers spanned by the tied group of compounds (e.g., a tie for 2nd and 3rd 
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would result in each compound being ranked "2.5"). For each experiment, the 
final rankings were plotted against the "real" rankings by known potency, to test 
how well the scoring algorithms ordered the compounds. These plots are shown 
in Figure 16K (for noise=10) and FIGURE 16L (for noise=30). 

For the experiment with noise up to 1 0 inhibition percentage points, shown 
in Figure 16K, the ranking of the composite scores is "perfecf (in the sense of 
having no inversions) over the range of tested concentrations (pK = 5 to 7). The 
pseudo-screen is unable to distinguish the potencies of compounds above or below 
this range. 

When the noise is much higher (30 percentage points), the ranking of 
individual compoimds is not as precise, but one can identify three cleanly divided 
"good-medium-bad" groups, as indicated by the dashed boxes on Figure 16L. 
Thus, even with this rather extreme noise level, the invention's scoring still 
successfully prioritizes the compoimds into groups. The range where 
discrimination is effective is still roughly the range of the test concentrations (pK 
= 5 to 7), but has been reduced somewhat by the higher noise. Note that the 
ranking within this range (the middle boxed group) is still mostly correct, with 
only one inversion, even for single compounds. 

Quantitative Fstimation of Potencies 

With confidence established that the algorithms provide reliable rankings 
of compounds by potency, it is possible to proceed to making quantitative 
estimates. The method uses calibration marker compounds. 

To understand this method, it is helpful to realize that the concept is 
analogous to the quantitative use of SDS polyacrylamide electrophoresis gels to 
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measure protein molecular weights. The proteins are known to migrate through 
the gel with speeds directly dependent on moleciilar weight, but it is difficult to 
calculate the absolute migration rates for a particular experiment. In dose- 
response scoring, the compounds are known to be properly ordered, but it is not 
clear how to calculate a potency (e.g., Kldiss or IC50) directly from the score. 

Protein chemists solve the molecular weight problem by running marker 
proteins, with know molecular weights, in the same gel, then using their band 
positions as calibmtion for the unknowns. Analogously, this invention's 
quantitative estimation method uses hypothetical marker compounds of know 
potency to internally calibrate the dose-response composite scores for the user's 
choice of a coloring rule, then interpolates the potencies of the unknowns. 

To create markers, the system asks the xiser to input the concentrations 
used for each of the dose-dependent activity data columns. The system then picks 
a set of calibration concentrations, at intervals of 0.5 log xmits, to span the tested 
range. For each of these calibration concentrations, a marker compound is created 
and added to the user's compound list, and artificial data is calculated for each 
colunm, from the same simple equilibrium binding equation used above in the 
validation study (this time with no "noise"): 

Pinhib = 100 • Gigand) / [K + (ligand)] 

The marker data are then colored by the same rule used for the user's 
compounds, and the scoring and sorting algorithm is re-run. 

The result is that the markers are sorted into the list according to their 
potencies, and the potencies of the other compounds can be estimated by 
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interpolating between the markers, using the composite dose-response scores. To 
illustrate, a typical section of a sorted list is shown below in Table 6 
(Figure 16M), using four colors. 

Potencies for compounds that fall between two markers are calculated by 
5 linear interpolation between the logarithms of the markers. Given the various 

uncertainties in the data values themselves £md in the evaluation process, it was 
found that linear interpolation between markers spaced at O.S log xmit intervals 
was sufficiently precise, and no more complex curve fitting was necessary. 

Validation of Quantitative Estimation 

10 Validation of the quantitative estimation method followed a procedure 

very similar to that used to validate the scoring, and using the same sets of test 
data with various noise levels. As before, the testing concentrations were from 10" 
^ to 10"'' M (negative log fix)m 5 to 7). Marker compounds (no noise) were added 
with pK's from 4.5 to 7.5, and K<iiss estimates for the noisy compoimds were 

15 carried out by the interpolation method. The results are shown below for the cases 

of 10 and 30 inhibition percentage points of noise. 

Figures 16N and 16P show that the estimates are clearly quite good 
within the range of the testing concentrations (pK 5 to 7), but the quality of 
estimation deteriorates quickly beyond those limits, and algorithm does not 

20 reliably distinguish among compounds whose potencies are more than a half log 

vmit beyond the testing range. Therefore, it was decided that presently preferred 
embodiments would not report any estimated values that fell outside the range of 
concentrations used in the testing data columns. Thus, in the example in TABLE 6 
(Figure 16M), the lowest testing concentration was 10"^ M (= 0.1 fiM). For the 
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first compound in Table 6, the system has estimated a potency with pICso > 7, 
but it conservatively only reports "<0. 1 /M." 



Table 7 summarizes the statistics of the estimations within the testing 
limits. Table 7 shows that the method successfully estimates the potencies within 
about a factor of two, even with high noise levels. 

Table 7. Statistics of Estimation Validations 



percentage 


number of 


average of 


Average error 


points 


compounds 


abs(log error) 


in ICso 


of noise 






(factor) 


10 


13 


0.22 


xl.6 


30 


13 


0.39 


x2.5 



Comparison to Other Methods of Qnantitative Estimation 

Further corroboration was obtained by treating some real data from T-cell 
proliferation blockage assays. It is estimated that these data have at least as much 
noise as the artificial test set with 30 inhibition percentage points added. The 
standard treatment of this data in the past has been to fit a dose-response ciirve 
with a Hill coefficient of 1 , using a PC-based program ORIGIN. (ORIGIN is a 
data analysis program from Microcal Software, Inc. of Northampton, 
Massachusetts. ORIGIN is used in this instance for non-linear least-squares 
fitting of dose-response curves to functional equations.) 

The data used here were from testing in the concentration range from 1 to 
0,03 jlM (negative log from 6 to 7.5). The plot in Figure 16Q shows the 
correlation of values estimated by this invention with values from ORIGIN fits. 
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Two compounds that the present invention estimates to be beyond the testing 
range, i.e., pICso below 6, are included as open diamonds, for illustrative purposes 
explained below. (As explained above, preferred embodiments of this invention 
would normally not report these values.) 

5 The results are consistent with the properties observed in the validation 

study. The present invention estimates are quite good within the testing range (6 
to 7.5). At the lower limit, this invention has made two estimations exactly at 10"^ 
M (arrows) which do not correlate as well with the ORIGIN fits. Nevertheless, 
because the whole plot spans only a relatively narrow range of potency, even these 

10 discrepancies are not very large. For all twenty eight (28) estimates within the 

testing range (including these two), the average logarithmic deviation between the 
two estimation methods is 0.18, corresponding to a factor of only 1.5. 

It is further noted that the calibration marker estimation method does not 
uniformly "flatten ouf* beyond the testing concentration range. The two open 

15 diamonds in Figure 16Q are estimations that presently preferred embodiments of 

this invention would not normally report because they have pICso < 6, but they 
agree well with ORIGIN fits 

It is interesting to compare calibration-maAer with curve-fitting results for 
particularly badly behaved data, such as dose-response curves that are not 

20 monotonic with respect to concentration. This is sometimes the type of data that 

emerges firom dose-dependent screening in a high-throughput mode. Studies of 

this type have been initiated by adding artificial noise to the extent of fifty (50) 

inhibition percentage points. 

Finally, it should be pointed out that there is some mechanical advantage 

25 to using the present invention relative to current practice of using ORIGIN. 
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ORIGIN is used by manually filling in a template with data, then manually 
executing a fit. Depending on the number of points and the degree of 
customization of parameters, this can take one to ten minutes of the user's time. 
The present invention, on the other hand, processes a whole spreadsheet at once 
(i.e., up to 16,383 compounds), and goes at a rate of about 3,000-4,000 
compounds per minute on a 200 MHz PC. 

Examples 

Figure 16A shows dose response data for twenty (20) compounds at four 
concentrations. The data have been grouped and the cells colored by the rule 
shown in FIGURE 16B. The result of the scoring and sorting process is shown in 
Figure 16C, where the compounds are ordered by decreasing values of the 
composite score (column H). Then, virtual "marker" compounds are added with 
known potencies spaced by 0.5 log units, and they are shown in Figure 16D, 
colored by the same rule and scored. The name of each marker compound 
designates the logarithm of its potency, e.g., "marker_7.0" has a potency IC50 — 
10'^ M. Figure 16E shows the result of sorting the list by decreasing composite 
score after adding the markers. This process then enables estimation of IC50 
values for the compounds by interpolating in the column (H) of ordered composite 
scores, and these estimates appear in two forms in colxmms I and J. 

6. Summarize Spreadsheet Statistics Mechanism 

This mechanism creates a table summarizing the entries in each colimm of 

a data sheet, to aid the user in deciding how to color each column. The 
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mechanism counts numeric, text, and data entries, and uses color to flag colunms 
that have mixed types. The mechanism also counts blanks, and specially flags 
colunms with "trailing blanks," i.e., columns which are shorter than the longest 
one on the spreadsheet. For numeric data, the mechanism calculates minimum, 
maximum, mean, and standard deviation, even in the presence of interspersed text 
entries. For text data, the mechanism presents a list of the text strings used and 
their occurrence counts. The mechanism creates a summary key of the column 
letters and headings as a text box that can be copied to other sheets for convenient 
reference. 

Figure 17A shows a sample spreadsheet containing miscellaneous data 
on twenty four (24) compounds. Figure 17B is the statistics sheet calculated 
from it. Each row of the statistics sheet describes one column of the original data 
sheet. First, the counts of numeric, text, date, and blank entries are listed, 
followed by two columns describing the total length of the data sheet. Then the 
minimum, maximum, mean, and standard deviation of any numeric data are 
reported. Finally, the statistics sheet lists a simunary of the text strings foimd in 
each original data column. As examples, in Figure 17B, one can see that 
original colunm A ("Cmpd") had twenty four (24) different text strings, that the 
numeric data in original column C ("Testl") had a mean of 2.385, and, flagged by 
the red coloring, that original data column E ("Tests ") had a mixture of ten 
numeric data and two text strings, both '7»I.A." 

The details of how the program code accomplishes this are 
straightforward, and one of ordinary skill in the art would know, from this 
description, including the Figures, how to make and use this invention. The 



57 



wo 



01/008039 PCTAJSOO/20401 

program loops through all the entries in the coliimn, testing the data type of each, 
and tallying the counts and numerical statistics. 



Spreadsheet Creation and Oi^anization 

The operations of this invention require a considerable amoimt of user 
input, e.g., to create well-structured spreadsheets, to define and apply diverse 
coloring rules for large numbers of columns, and to use these colors and the user's 
stated scientific priorities to create meaningfully ordered lists of compounds or 
clusters. 

The user interface of this invention has been designed to ease this process 
and help the scientist focus on the tasks of formulating and recording clear 
descriptions of the evaluation parameters. Accordingly, this invention provides a 
number of tools and mechanisms to aid in the creation and organization of 
spreadsheets. These tools and mechanisms include: 

• Smart Append Column Mechanism 

• Merge Data Mechanism 

• Data Import Mechanism 

• Workbook Navigation Shortcuts 

• Conversion of "uM" to )iM and '*UU" to jx 

• Delete Pictures Mechanism 

• Change Values in Column Mechanism 

• Concatenate Values across Columns Mechanism 

• Delete Leading Inequality Signs Mechanism 

• Delete Derived Sheets Mechanism 
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Smart Append Column 

This mechanism appends new columns of data onto an existing 
spreadsheet, matching rows by labels (e.g., compoiind numbers). The mechanism 
copies all data to a new sheet before doing its work, leaving the original sheets 
unchanged. There is no need for the user to pre-sort any of the data. The 
mechanism provides optional case-sensitive or case-insensitive label matching. 

New rows are added at the bottom when new labels do not match any old 
labels. Rows with missing labels are identified and the system ofifers to fill them 
by copying previous label. Rows with repeated labels (i.e., replicate data) are also 
identified and the system ofifers a choice from among several automated 
processing rules, or manual fixing. A fast matching algorithm temporarily sorts 
rov^ by label, then restores original order when finished. Several intermediate 
stopping points are offered and extra data viewing options for conservative users 
worried about errors. 



Merge Data Mechanism 

The merge data mechanism copies new data values from an appended 
column into an older column. The mechanism copies all data to a new sheet 
before doing its work, leaving the original sheets unchanged. The mechanism 
detects cells where new data would overwrite old data that is different, flags them 
with color, and alerts the user. Several intermediate stopping points are offered to 
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the user, as are extra data viewing options are oiFered for conservative users 
worried about errors. 



Data Import 

One-button (or one-menu-click) import of existing EXCEL spreadsheets 
into an integrated file, which contains both the data and the related control sheets. 
The mechanism offers to search for and remove any leading or trailing spaces in 
the imported data and offers to consolidate replicate data rows into xmique ones, 
using usGT choices as to how to handle the replicate data. The mechanism also 
detects hidden rows and offers to unhide them and detects formulas and offers to 
convert them to values. This mechanism is also used to update to newer version 
of the system. 

Workbook Navigation Shortcuts 

The system includes various workbook navigation shortcuts including: 

• A special added drop-down menu which includes commands for jimiping 
durectly to the various control sheets. These commands also have keyboard 
shortcuts assigned to them. 

• From a cell on a control sheet that contains the name of a data sheet, a 
special item on the right-click shortcut menu jumps directly to that data 
sheet Other special items on this menu enable a "Twin Screen" display to 
see two sheets at once. 
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• To aid in choosing columns to enter on control sheets, there is a special 
"Twin Screen" display triggered by rigjit-clicking any cell on a control 
sheet that contains the name of a data sheet. 



ConvCTt to uM and "UU" to u 

Preferred embodiments of the system of this invention require the data 
spreadsheet to have one and only one row of column headings. The user can type 
either of the encoded strings '^iM" (lowercase u, uppercase M) or "UU" (both 
uppercase) into any column heading, select the cell or whole row of headings, then 
pick this command. Each **uM" in the selection will be converted to "^iNT*, and 
each 'TTU'* will be converted to a "^i". The code recognizes the special exception 
of the word "VACUUM" as long as it doesn't end with the cases "uM." This 
conversion allows the user to avoid the confusing use of lowercase "u" or the 
colimm-widening use of the fall prefix "micro." This utility appears on the 
system menu. 

Delete Pictures 

The system provides a mechanism for removing pictures containing 
chemical structures, in order to reduce file size, processing time, and confiision 
when they do not align properly after row sorting. 
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Change Values in Column 

This is a mechanism for regularizing data in a spreadsheet column. It 
facilitates replacement of all occvirrences of a given value by another. The 
mechanism creates backup copies of the original column, and updates any existing 
data statistics for the edited sheet. 

Concatenate Values across Columns 

The system provides a mechanism for regularizing data in a spreadsheet 
column. Some possible uses include: (a) construction of unique row labels: M- 
number plus stroke number -> "Ml 23456/001"; and (b) reconstitution of 
numerical inequalities from separate columns: plus a nxmiber — > ">number". 
The user is provided with an option to include linking (delimiting) text strings 
between values and an option to include or skip blanks. The system retains the 
original columns and inserts a new one for the results. 

Delete Leading Inequality Signs 

Another mechanism for regularizing data in a spreadsheet column includes 
the mechanism to delete leading inequality signs. This mechanism converts 
entries like ">1000" to just the number "1000". This must be used with 
considerable caution, because it is the equivalent of creating a false test result. It 
is generally preferable to color the cells containing text strings with the data 
coloring mechanism described above, rather than alter them. All later processing 
is based on the colors, not the cell values. This mechanism also deletes inequality 
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The mechanism creates backup 



Delete Derived Sheets 

The system menu includes a command to delete all output sheets from the 
current workbook, with separate user confirmation for each one. This is intended 
as a cleanup mechanism for information that may be outdated and is easily 
regenerated by subsequent system runs. 

Initial experience with the coloring tool has revealed that color coding has 
more subtle, but far-reaching usefulness. The colors themselves also can act as a 
"currency of exchange," a medium for comparing the quality of one kind of result 
to the quality of a very different kind of result. For example, an HTS activity of 
"95% inhibition" may be considered desirable and color coded, e.g., green. In the 
same list of compounds, a molecular weight between 400 and 600 may be consid- 
ered optimally desirable, and thus also color-coded green. If the user takes care 
when assigning colors, "green" takes on a common meaning across the board. 
This translation of data values into colors then opens up a cornucopia of 
possibilities for processing the colors (as numerical color indices) and comparing 
compounds, searching, in our example, for the ones that are the "most green." 

Accordingly, the system includes tools to numerically score individual 
compoimds or clusters of compounds by the colors that appear in their various 
data columns. The system can then create a new spreadsheet sorted by this score 
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(either by single compoimds, or cluster-by-cluster, the choice being the user's), in 
which the "most green" compounds will then appear at the top. 



Examples 

Application to Portfolio Management 

The system can equally well be applied to any set of data where the rows 
are cases of a similar construct, with the columns being various properties of each 
case. For example, a data spreadsheet can contain a list of competing projects or 
investments for a company's portfolio, with the colurons containing various 
managers' ratings of each project or investment. Figure ISA shows an example 
of twenty projects, each of which has been scored 1, 2, or 3 on two factors, one 
more important than the other, by each of three managers. The sheet has been 
colored by the rule shown in FIGURE 18B. Then the data were scored and sorted 
by the sorting rule of Figure 18C, and the result is shown in Figure 18D. 
Clearly, the projects that were given a "3" in the important factor come to the top, 
and it can be seen that the less important factor does indeed matter less to the final 
ordering. The colors also help to flag anomalies, such as a low score by one 
manager on an otherwise high-ranking project. 

In general, the data can be various sorts of data. Some examples are listed 
below and illustrated in the referenced Figures. 

Figure 19 shows a list of drug candidate compoimds, scored and sorted 

by a composite of ten parameters that describe their physical, chemical, and 

biological properties. Green shades indicated desirable values; red shades are 
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undesirable. The display is compressed vertically with the vertical re-scaling tool 
to clearly display the difference in coloring patterns between the top eighty (80) 
compounds and the bottom eighty (80) compounds (separated in the illustration by 
a blank band). 

Figure 20 shows a list of proteins that are candidates for targets for drugs, 
chosen from a pool of candidate genes, scored and sorted by a composite of eleven 
parameters that describe their suitability. 

FlGUR£ 21 shows a list of research projects competing for resources. Each 
project has been scored according to several evaluation factors, and the whole 
array has been sorted by color groups. The same construct is useful for evaluating 
employee performance or job candidates. 

Figure 22 shows a list of phamiaceutical companies and their current 
status with regard to discovering or marketing products in each of various disease 
areas. Each company's line of products has been scored according to the maturity 
of the offerings, and the whole array has been sorted by color groups. 

Figure 23 shows the use of data-grouping (coloring) rules to visualize the 
time courses of drug concentrations in blood. In this example, light colors were 
chosen to represent high concentrations of drug in the blood, while dark colors 
were chosen to represent low concentrations. The figure shows a wide range of 
differing time courses. 

Figure 24 shows the use of data-grouping (coloring) rules to visualize the 
matrix of pairwise cross-correlations of the results of eight (80) drug screens. Li 
this example, Ught colors were chosen to represent low correlations, while dark 
colors represent high correlations. 
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Quantitation of the Similarity of Data Grouping in Two Variables 
As part of the present invention, a mechanism is provided for assigning a 
quantitative measure to the degree of similarity of grouping (visualized by color 
coding) of data in each of two columns of an Excel spreadsheet. The mechanism 
allows for a correlation-like analysis on a wide variety of data types, including 
text, or mixed numbers and text. 

In the data-exploration paradigm of the present invention, one of the first 
steps a user takes is to divide the range of data values in each column into a small 
niimber of groups for finrther analysis, thus effecting a reduction of precision 
which has been found to be useful in various ways. 

It is sometimes useful to explore whether the rows of the data matrix have 
been divided into similar groupings in each of two different colunms. For 
example, a researcher might ask, "Do the high molecular weight compounds tend 
to be the ones whose solubilities fall below the limits of measurement?" In other 
words, this would mean to compare the groupings in the molecular weight column 
with the groupings in the solubility column. 

If the data were strictly quantitative, this would be called correlation of 
variables, and there exist a number of perfectly good statistical measures of the 
phenomenon. However, one of the imique capabilities of the present invention 
lies in dealing with textual data and mixtures of nxmibers and text, and it would be 
helpiiil if one could translate the visible color patterns of the present invention to 
some kind of quantitative measure of correlation. In order to avoid confusion with 
standard statistical correlation, the distinct term "color grouping similarity" is used 
to describe the new measure. 
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In preferred embodiments of this invention, the data grouping is stored in 
the form of the colored backgrounds of data cells. At first glance, one might 
simply seek to compare the colors of the first colunm with those of the second, 
and count the number of rows with matching colors. However, a color grouping 
similarity tool must be able to cope with the possibility that the colors are 
different. This could happen because the user chooses completely different color 
schemes for the two columns, or because the correlation is negative. As an 
example of negative correlation, suppose colimrn A contains random numbers 
between 0 and 1, colored such that those above 0.5 are green and those below 0.5 
are red. Then imagine a column B where each value is equal to one minus the 
corresponding value in column A, i.e., the "one's complement." If the user colors 
the second colunm with exactly the same coloring rule as the first, every row will 
have a different color in column B than in colimm A. None of the colors will 
match, though the groupings are perfectly correlated. To be successful, the tool 
must deliver a high measure of correlation between such pairs of colvmms. The 
algorithm described below was designed to perform in this way. 

Algorithm for Measuring Data Grouping Similarity 
The algorithm was derived firom semi-quantitative reasoning, as follows. 
It is based on the qualitative question, "For all rows that have one particular color 
in the first column, to what degree do they have a uniform color in the second 
column (not necessarily the same color as in the first column)?" The quantified 
answer to this question is then averaged over the set of colors used. 
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The details of the mechanism can be seen by example. First, to compare 
the grouping in two columns A and B, a matrix of "ordered color pair counts" 
(OCPC) is defined such that each matrix element OCPCy is the count of 
spreadsheet rows where one finds color / in spreadsheet colunm A and color j in 
5 spreadsheet colimm B. Then, the rows of the two spreadsheet columns are 

scanned to count the number of occurrences of each ordered color pair and thus to 
determine the values of the matrix elements. 

In the discussion below, carefully distinguish the rows and columns of the 
user's spreadsheet of data from the rows and columns of the derived OCPC 
10 matrix. 

If the two spreadsheet columns had exactly the same colorings the nonzero 
elements of the OCPC matrix would all be on the diagonal. As a simplified 
example, consider four colors (green, yellow, orange, red) and a total of 16 data 
rows. The diagonal matrix might be (zero elements left blank for emphasis) 

15 



color in column A color in column B ~> 


green 


yellow 


orange 


red 


green 


4 








yellow 




2 






orange 






7 




red 








3 



In the case above, there are groups of 4 spreadsheet rows colored green (in 
both columns), 2 rows colored yellow, 7 rows colored orange, and 3 rows colored 
red. 

20 In contrast, if the groupings were the same, but the coloring rule for the 

second spreadsheet colimm used the same colors in a different order, the OCPC 
matrix might look like the following, no longer diagonal: 
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color in column A color in column B -» 


green 


yellow 


orange 


red 


green 






4 




yellow 


2 








orange 








7 


red 




3 







A simple extension applies if entirely different colors (cyan, blue, maroon, 
purple) are used in the second spreadsheet colunm. The OCPC matrix might then 
be: 



color in column A color in column B —> 


cyan 


blue 


maroon 


purple 


green 






4 




yellow 


2 








orange 








7 


red 




3 







In any of the three cases above, the groupings are identical, and the OCPC 
matrices have the property that each matrix row and matrix column has only one 
nonzero element. That lone element is necessarily equal to the simi of the row or 
column. This situation should receive the highest similarity score. 

One way to define the contrasting situation that would deserves the lowest 
similarity score would be that for each user-defined group of spreadsheet rows in 
one spreadsheet column, the other spreadsheet column has a ""maximEilly non- 
unique" set of colors. In the corresponding OCPC matrix, this corresponds to 
each matrix row or colimin having a broad distribution of values rather than a 
single non-2«ro, a uniform distribution has been chosen as the definition of this 
state: 
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color in column A color in column B — > 


cyan 


blue 


maroon 


purple 


green 
yellow 
orange 
red 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 



This low-similarity state can be more precisely defined by saying that each 
element in a given matrix row or column is the averse of all the coimts in that 
matrix row or colimm. 

With these concepts defined and the OCPC matrix filled, the scores can 
then be derived. Each OCPC matrix row (corresponding to a color group in the 
first spreadsheet column) is selected in turn for scoring. Each element in the 
matrix row is given a score between zero and one, according to its linear 
interpolation between: on the one extreme, the average of the nonzero elements in 
the row, and on the other, the sum of the row or column (i,e., the maximum value 
it could have if all the others were zero). The scores are then averaged over all the 
rows of the OCPC matrix to generate a row-wise score component. 

Next, the corresponding process is applied to the columns of the OCPC 
matrix (each corresponding to a color group in the second spreadsheet column 
rather than the first). The resulting column-wise score component is averaged 
with the row-wise score component, then the average is scaled to a maximum of 
100 to generate the final similarity score for the two spreadsheet columns. 

Interpretation of the Similarity Scores 

Although the scores are qxiantitative and well-defined, their interpretation 
is best done in a partly subjective manner, based on experience. The behavior of 
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the scores is best understood by example. In Figure 25, the leftmost ("base") 
colxmm has been compared to each of the others, and the scores are shown as well 
as pictures of the grouping patterns. Comparison of the base with the next column 
shows that the tool delivers a maximal score of 100 for identical grouping, even 
when the colors are completely different. Then, stepping across the figure toward 
' the right, it can be seen how the score decreases as the grouping pattern gradually 
becomes less similar to that of the base colimin. All the way down to a similarity 
score of 40, it is still basically true that the Ught colors are on top and the dark on 
the bottom, with increasing "noise," but when the score falls to 20, the pattem 
appears to have no correspondence to that of the base. 



Implementation of the Data Grouping Similarity Tool 

Li practice, the tool allows the user to choose two sets of spreadsheet data 
columns. The program then automatically generates all pairs containing a colmnn 
firom the first set with a column firom the second set, then writes the similarity 
scores onto a newly inserted spreadsheet in the user's workbook. The output takes 
the form of a table where the degree of similarity is itself color-coded to aid the 
user in identifying significant cases. An example appears FIGURE 26. 



While the invention has been described with reference to particular 
mechanisms (algorithms, processes and fimctions) and architectures, one skilled in 
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the art would realize that other mechanisms and/or architectures could be used 
while still achieving the invention. 

While embodiments of the present invCTtion have been described with 
particular setup and initialization procedures, other setup and/or initialization 
procedures can be used. 

Further, while many of the operations have been shown as being 
performed in a particular order, one skilled in the art would realize that other 
orders, including some parallelization of operations, are possible and are 
considered to be within the scope of the invention. 

While the present invention has been described with reference to analysis 
and pattern recognition in data sets relating to chemical compounds, the methods, 
systems and devices of this invention are considered to be general constructs 
covering other, non-chemical data sets. 

Thus, are provided methods, systems and devices for analysis and pattem 
recognition in large, multidimensional data sets using low-resolution data 
grouping. One skilled in the art will appreciate that the present invention can be 
practiced by other than the described embodiments, which are presented for 
purposes of illustration and not limitation, and the present invention is limited 
only by the claims that follow. 
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What is claimed is: 

1 . A method of operating on data, the method comprising: 
providing at least one user-defined grouping rule for grouping the data into 

a user-definable number of groups; and 

applying at least one of the grouping rules to the data. 

2. A method as in claim 1 herein the data are provided in a table and 
wherein the at least one grouping rule applies to at least one user-selectable 
column of the table. 

3. A method as in claim 1 wherein the at least one grouping rule 
defines breakpoints corresponding to the user-definable number of groups, and 
wherein application of the at least one rule to the data divides the data into groups 
based on the breakpoints, 

4. A method as in claim 1 fiirther comprising: 

presenting the grouped data in a maimer that visually distinguishes the 

groups. 

5. A method as in claim 4 wherein the grouping rules associate colors 
with groups and wherein the presenting of the grouped data fiirther comprises: 

coloring an aspect of the data according to the rules. 
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6. A method as in claim 4, wherein the data are in labeled columns in 
a spreadsheet, and wherein the at least one grouping rule specifies at least one 
breakpoint and a corresponding color for each at least one breakpoint, and wherein 
the presenting of the grouped data comprises: 

coloring each data item in the at least one labeled column of the data based 
on the at least one breakpoint and the corresponding color of the at least one 
breakpoint. 

7. A method as in any one of claims 3 and 6, wherein the breakpoints 
are selected from: (a) numeric values; and (b) textual values. 

8. A method as in claim 3 wherein the at least one breakpoint is 
determined automatically based on the data. 

9. A method as in claim 5 wherein the data are provided in a table, 
wherein the coloring of an aspect of the data comprises: 

coloring backgrounds of table cells according to the rules. 

10. A method as in claim 1 wherein the number of groups is fewer than 
a niunber of possible data values. 

11. A method of operating on data, the method comprising: 
providing at least one user-defined grouping rule for grouping the data into 

a user-definable number of groups; 
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applying at least one of the grouping rules to the data to generate grouped 

data; 

providing at least one user-defined scoring rule for scoring the grouped 
data according to user-defined scores; and 

applying at least one of the scoring rules to the grouped data to score the 
grouped data. 

12. A method of operating on data, the method comprising: 
generating grouped data by applying to the data at least one user-defined 

grouping rule for grouping the data into a user-definable number of groups; and 

scoring the grouped data by applying to the grouped data at least one user- 
defined scoring rule for scoring the grouped data according to user-defined scores. 

13. A method according to claim 1 1 or 12 wherein the data comprises a 
number of parameters for each of a number of cases and the scoring rule 
comprises a scoring function of user-selectable parameters and user-defined 
weights for the selected parameters to be used in scoring the cases, wherein the 
scoring of the grouped data comprises: 

applying the fimction to the data to obtain a score for each case. 

14. A method according to claim 13, finrther comprising: 
sorting the scored cases by score. 

15. A method according to claim 14, wherein the scored cases are 
sorted individually. 
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16. A method according to claim 14, wherein the scored cases are 
sorted by cluster. 

17. A system for operating on data, the system comprising: 

a mechanism constructed and adapted to provide at least one user-defined 
grouping rule for grouping the data into a user-definable number of groups; and 

a mechanism constructed and adapted to apply at least one of the grouping 
rules to the data. 

18. A system as in claim 1 7 wherein the data are provided in a table 
and wherein the at least one grouping rule applies to at least one user-selectable 
column of the table. 

19. A system as in claim 17, wherein the at least one grouping rule 
defines breakpoints corresponding to the user-definable number of groups, and 
wherein application of the at least one rule to the data divides the data into groups 
based on the breakpoints. 

20. A system as in claim 17, fiirther comprising: 

a mechanism constructed and adapted to present the grouped data in a 
manner that visually distinguishes the groups. 
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21 . A system as in claim 20, wherein the grouping rules associate 
colors with groups and wherein the mechanism constructed and adapted to present 
the grouped data further comprises: 

a mechanism constructed and adapted to color an aspect of the data 
according to the rules. 

22. A system as in claim 20, v^erein the data are in labeled columns in 
a spreadsheet, and wherein the at least one grouping rule specifies at least one 
breaIq}oint and a corresponding color for each at least one breakpoint, and wherein 
the mechanism constructed and adapted to present the grouped data comprises: 

a mechanism constructed and adapted to color each data item in the at least 
one labeled column of the data based on the at least one breakpoint and the 
corresponding color of the at least one brealq>oint. 

23 . A system as in any one of claims 1 9 and 22, wherein the 
breakpoints are selected from: (a) numeric values; and (b) textual values. 

24. A system as in claim 19 further comprising: 

a mechanism constructed and adapted to detennine at least one breakpoint 
automatically, based on the data. 

25. A system as in claim 21 wherein the data are provided in a table, 
wherein the mechanism constructed and adapted to color an aspect of the data 
comprises: 



77 



wo 01/008039 PCTAJSOO/20401 
a mechanism constructed and adapted to color backgrounds of table cells 
according to the rules. 



26. A system as in claim 17 wherein the number of groups is fewer 
than a number of possible data values. 

27. A system of operating on data, the system comprising: 

a mechanism constructed and adapted to provide at least one user-defined 
grouping rule for grouping the data into a user-definable number of groups; 

a mechanism constructed and adapted to apply at least one of the grouping 
rules to the data to generate grouped data; 

a mechanism constructed and adapted to provide at least one user-defined 
scoring rule for scoring the grouped data according to user-defined scores; and 

a mechanism constructed and adapted to apply at least one of the scoring 
rules to the grouped data to score the grouped data. 

28. A system of operating on data, the system comprising: 

a mechanism constructed and adapted to generate grouped data by 
applying to the data at least one user-defined grouping rule for groupmg the data 
into a user-definable number of groups; and 

a mechanism constructed and adapted to score the grouped data by 
applying to the grouped data at least one user-defined scoring rule for scoring the 
grouped data according to user-defined scores. 
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29. A system according to claim 27 or 28 wherein the data comprises a 
number of parameters for each of a nimiber of cases and the scoring rule 
comprises a scoring function of user-selectable parameters and user-defined 
weights for the selected parameters to be used in scoring the cases, wherein the 

5 mechanism constructed and adapted to score of the grouped data comprises: 

a mechanism constructed and adapted to apply the function to the data to 
obtain a score for each case. 

30. A system according to claim 29, further comprising: 

10 a mechanism constructed and adapted to sort the scored cases by score. 

31. A system according to claim 30, wherein the scored cases are 
sorted individually. 

15 32. A system according to claim 30, wherein the scored cases are 

sorted by cluster. 

33 . A computer-readable memory medium encoded with program data 
representing a computer program that can cause a computer to implement a 
20 method of operating on data, the method comprising: 

providing at least one user-defined grouping rule for grouping the data into 
a user-definable number of groups; and 

applying at least one of the grouping rules to the data. 
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34. A medium as in claim 33 wherein the data are provided in a table 
and wherein the at least one grouping rule applies to at least one user-selectable 
column of the table. 

35. A medium as in claim 33 wherein the at least one grouping rule 
defines breakpoints corresponding to the user-definable number of groups, and 
wherein application of the at least one rule to the data divides the data into groups 
based on the breakpoints. 

36. A medium as in claim 33, wherein the method further comprises: 
presenting the grouped data in a manner that visually distinguishes the 

groups. 

37. A medium as in claim 36 wherein the groupmg rules associate 
colors with groups and wherein the presenting of the grouped data further 
comprises: 

coloring an aspect of the data according to the rules. 

38. A medium as in claim 36, wherein the data are in labeled columns 
in a spreadsheet, and wherein the at least one grouping rule specifies at least one 
breakpoint and a corresponding color for each at least one breakpoint, and wherein 
the presenting of the grouped data comprises: 

coloring each data item in the at least one labeled column of the data based 
on the at least one breakpoint and the corresponding color of the at least one 
breakpoint. 
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39. A medium as in any one of claims 35 and 38, wherein the 
brealcpoints are selected from: (a) numeric values; and (b) textual values. 

40. A mediimi as in claim 35 wherein the at least one brealqjoint is 
determined automatically based on the data. 

41 . A medium as in claim 37 wherein the data are provided in a table, 
wherein the coloring of an aspect of the data comprises: 

coloring backgroimds of table cells according to the rules. 

42. A medium as in claim 33 wherein the number of groups is fewer 
than a number of possible data values. 

43. A computer-readable memory medium encoded with program data 
representing a computer program that can cause a computer to implement a 
method of operating on data, the method comprising: 

providing at least one user-defined grouping rule for grouping the data into 
a user-definable number of groups; 

applying at least one of the grouping rules to the data to generate grouped 

data; 

providing at least one user-defined scoring rule for scoring the grouped 
data according to user-defined scores; and 

applying at least one of the scoring rules to the grouped data to score the 
grouped data. 
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44. A computer-readable memory medium encoded with program data 
representing a computer program that can cause a computer to implement a 
method of operating on data, the method comprising: 

generating grouped data by applying to the data at least one user-defined 
grouping rule for grouping the data into a user-definable number of groups; and 

scoring the grouped data by applying to the grouped data at least one user- 
defined scoring rule for scoring the grouped data according to user-defined scores. 

45. A medium according to claim 43 or 44, wherein the data comprises 
a number of parameters for each of a number of cases and the scoring rule 
comprises a scoring function of user-selectable parameters and user-defined 
weights for the selected parameters to be used in scoring the cases, wherein the 
scoring of the grouped data comprises: 

applying the function to the data to obtain a score for each case. 

46. A medium according to claim 44, the method further comprising: 
sorting the scored cases by score. 

47. A medium according to claim 46, wherein the scored cases are 
sorted individually. 

48. A medium according to claim 46, wherein the scored cases are 
sorted by cluster. 
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